Computer Science notesOperating Systems

An operating system is a program that acts as an intermediary between a user of the PC and the hardware.

System Components

A computer system can be abstracted into various components. Users (people, or other machines and systems) interact with system and application programs (defines the way in which system resources are used to solve the computing problems of the users), which interact with the OS (controls and co-ordinates the use of hardware among the various application programs for the user), which interacts with the hardware (the most basic computing resources - CPU, memory, I/O, etc). When developing an OS, all levels must be considered, not just the OS level.

For users, they interact with the hardware (monitor, keyboard, mouse, etc) to control the system and application programs. The program needs an efficient interface to interact with the OS to utilise the resources it needs to accomplish the tasks the user wants to do and to also communicate with the user. The OS interface describes the calls that an application program can make into the OS.

The OS is designed for ease of use, which is in terms of not having to deal with low level hardware issues. This is more important than CPU usage - a user doesn't care if the CPU is idle 99% of the time. The goal of OS design is to provide a useful, simple and efficient system:

  • ease of use
  • minimise resource usage
  • maximise resource utilisation

However, abstraction away from the physical machine to provide ease of use can lead to inefficiencies. There's a trade off required between the extra functionality of the virtual machine and the overheads required to implement this.

History of Operating Systems

See CAR for
Computer History

1st Generation (1945-1955)

These are systems which didn't have operating systems. Users had complete access to the machine and the interface was basically raw hardware. Programming was accomplished by wiring up plugboards to control basic behaviour of the computer - the problems to be solved were normally numerical calculations.

Operating systems and programming languages were virtually unheard of.

2nd Generation (1955-1965)

These systems had minimal operating systems and were typically mainframe computers used to support commercial and scientific applications. Jobs and programs were written using punch cards and the computers were normally application specific.

The minimal OS was required to interact with I/O devices and a subroutine library existed to manage I/O interaction. The library is loaded into the top of memory and stays there. Application programs existed (compiler/card reader), but were limited.

Optimisations could be made in reducing the setup time by batching similar jobs. A small computer reads a number of jobs onto magnetic tape and the magtape is run on the mainframe. A number of jobs is then done in succession without wasting CPU time. These systems had a small permenantly resident OS which had initial control and controlled automatic job sequencing.

These systems had a new memory layout and the processor had two modes - Normal, where some instructions (e.g., HALT) are forbidden, which was used by the batch programs, and Privileged, where all instructions are available, this was the level the OS ran in. Some CPUs also implemented protection in Normal mode to prevent the user program overriding the first section of memory in which the OS existed. Normal mode also has an instruction to call the OS (hand back control).

Batch systems were good for CPU bound scientific computing with minimal I/O, but bad for data processing with high I/O as the CPU is often idle. The speeds of the mechanical I/O devices is considerably slower than that of the electronic computer and jobs often have to pause when performing I/O. In some cases, the CPU was often idle for >80% of the time.

However, with a permenantly resident OS in the memory, the OS now becomes an overhead, with less physical memory available for the application running at the time. Programming discipline in the application is still required as the OS can not protect against the application crashing the machine via bad I/O, infinite loops, etc...

3rd Generation (1965-1980)

The introduction of ICs led to minicomputers, which tried to merge the scientific computers with I/O computers and this introduced multiprogramming. Disk drives allowed jobs and programs to be stored on disk, allowing the computer to have access to many jobs. The CPU can now load many jobs into memory by giving each one a memory partition and then execute them by multiplexing between them.

In these OSs, the OS picks a job and then passes control to it. Eventually, the job either finishes or waits for some task (e.g., I/O) to complete. In a non-multiprogrammed system, the CPU would now be idle. In a multiprogrammed system, the OS switches the CPU to another job.

Eventually, the operation the original job was waiting for completes, enabling that job to be eligable to be waiting for CPU time. As long as there is at least one job waiting to execute, the CPU is not idle.

When a job completes, the OS can load another job into that memory slot. This technique is called spooling - Simultaneous Peripheral Operation On Line, and is still used today for operations such as print queues.

Multiprogramming OSs are quite complex - they now handle quite complex I/O devices (disks, etc) and swapping between applications when suspended on I/O.

Multiprogramming works best when there are a small number of large jobs - this reduces the overhead of the OS due to managing memory between less jobs. This method also encourages a fairer overall use of the computer.

Multiprogramming systems led on to time-sharing/interactive systems. The CPU here is multi-plexed between several jobs that are kept in memory and on disk. The CPU is allocated to a job only if the job is in memory. When a job waits on I/O, the OS switches the CPU to another job. Jobs can also be switched in between memory and the disk. On-line communication between the user and the system is provided - when the OS finishes the execution of one command, it seeks the next "control statement" from the user's keyboard.

In time-share systems, the user has the illusion of using the computer continuously and the user can now interact directly with the computer whilst it is running. This encourages smaller jobs and more efficient use of the computer. Additionally, many users can share the computer simultaneously, submitting jobs/commands directly via keyboards/terminals, etc.

The first time-sharing system was CTSS, developed by MIT. This formed the basis for MULTICS, the first commercial OS that had limited success. UNIX was then developed by Bell Labs in the high-level language C, based on B.

4th Generation (1980-present)

The fourth generation of operating systems broguth about the first mass user operating systems, as LSI and VLSI enabled low-cost computing, with mini and desktop computers becoming common.

The age of mass computing is based around desktop computers. The desktop PC is dedicated to a single user and the distinction between desktop and minicomputers become increasingly blurred. Many minicomputers are now merely highend desktop PCs, with advanced I/O capabilities and multiprocessors.

The multiprogrammed OS (UNIX, VMS, etc) was developed further for mass user computers, adding functionality like: massive file storage, fast networking, complex application programs and the potential for multiprocessor. Modern desktop systems have dedicated single user OSs (e.g., CP/M, DOS, MS-DOS, Mac OS, Windows) with complex I/O devices: keyboards, mice, display screens, small printers, graphics tablets, etc, and complex graphic applications, with user convenience and responsiveness considered of high importance.

An increased blurring between desktop computers and minicomputers has led to minicomputer OSs becoming available for desktop, e.g., Linux - based on the BSD and SysV variants of UNIX.

Much of the current OS trends are based in the relative cost of computers and people. In the beginning, computers were few with very expensive hardware, so the people cost was cheap compared to the computer cost, therefore the goal was the maximise hardware utilisation. Now we have an amazing number of computers with cheap hardware and the human cost is expensive compared to the hardware cost. The goal now is ease of use.

Current OS trends are towards specialist requirements - moving away from the general purpose OS for specialist needs, or tuning existing general purpose OSs. The introduction of real-time OSs, which require low overheads and high efficiency of resource usage. Multi-processor machines are becoming more common and OSs need to update to be able to handle the extra hardware. On the other hand, embedded devices are now become very common, which have limited power, space and weight requirements.

Role of the Operating System

Resource Sharing

The OS contains a set of algorithms that allocates resources to the programs executed on behalf of the user. These resources include time, power, hardware, etc...

Control Program

The control program controls the operation of the application programs to prevent errors affecting other programs.

Provision of a Virtual Machine

This hides interfaces to I/O devices, filing systems, etc, and provides a programming interface for applications.

Kernel

The kernel is the only program resident all the time (all other applications are application programs).

The modern general purpose OS contains various components to accomplish these roles, such as:

  • Process management
  • Main memory management
  • File management
  • I/O system management
  • Secondary storage management
  • Protection system
  • Command interpreter system

All but the last two in the above list can be consdered to be resource management

Processes

A program is an algorithm expressed in some notation. A process is the activity of executing the program.

Early systems only allowed one program to be executed at a time. Later systems allowed multiprogramming, where many programs are loaded into memory at once and only one can be executed at any time. Programs are termed jobs or processes when executing. A process is sequential and can only do one thing at a time.

Multiple user processes run in seperate (protected) address space, and can't directly access other memory spaces. Some user programs can share address space (e.g., an interpreter), but only the code (which is read only) is shared. The kernel can access all, however.

As the CPU only has one program counter, so the OS must use a system known as process switching.

Process Creation

Processes are created by one of 4 principle events:

  1. System initialisation (typically, this is the OS and system programs)
  2. By running a process (started by the kernel or user programs)
  3. By user execution (normally with interacting with some kind of shell)
  4. Initiation of a batch job (typically only in batch systems, however)

When a new process is created, the OS must assign memory and create a task/process control block (TCB/PCB) which holds various data about the currently running process, such as:

  • Process state
  • Unique program ID
  • Program counter
  • CPU registers
  • Memory information (this, along with the program counter and CPU registers forms the process context)
  • I/O status information (e.g., open files)

Process Hierarchies

When a parent creates a child process, the child process can create its own child processes, and we get a hierarchy (or tree) of processes. UNIX terms this as a process group, whereas Windows has no such concept, and all processes are created equal.

Process State

In single processor systems, only one process can be in the running state at any time. If all processes are waiting (or blocked), then the processor is idle, until one of the processes become unblocked, or ready.

The states possible are:

  • new: The process is being created
  • running: Instructions are being executed
  • waiting: The process is waiting for some event to occur
  • ready: The process is waiting to be run
  • terminated: The process has finished execution

The switch between processes is called context switching - where the context of the currently running program is saved in the process context of the PCB and then the process context from the new PCB is loaded into the CPU. The context switch time is overhead as the CPU does nothing useful whilst switching. The time it takes for a switch is dependent on the hardware support available.

Process Termination

There are various conditions that can terminate processes:

  1. Normal exit (voluntary)
  2. Exit on an error condition (voluntary)
  3. Fatal error (involuntary)
  4. Killed by another process (involuntary)

When a process is terminated, all resources used by a process is reclaimed by the OS, with the PCB last (as the PCB contains information on where the resources are to be reclaimed).

System Calls

System calls provide an interface between the running process and the OS. They are available as an assembly level instruction, normally some kind of software interrupt (INT 0x80 on Linux) which corresponds to a system call to which the OS reacts appropriately.

Parameters are used in system calls, just as procedure calls. Normally registers, tables or stacks are used, but in modern processors, tables are the most popular. Another method is using instructions that force an address space change without using an interrupt, but the effect is the same.

System calls normally map onto the services provided by the OS.

POSIX is an IEEE standard based on the UNIX interface. It is not an OS, merely a standard API which is very vague and covers almost all systems calls you would ever have to make. Some aspects of most major OSs are POSIX compliant, but there exists no complete implementation of the POSIX standard. Linux is primarily POSIX compliant for the core calls.

Threads

Traditional (heavyweight) processes have a single control path through their code. Many applications are concurrent, but do not want the overhead of multiple processes and inter-process communication (e.g., a word processor has UI, reading input, a background spell checker, etc).

Many OS's provide threads (sometimes called lightweight processes), a basic unit of CPU use. Each is a seperate control path through the code, and threads within a process are essentially independent. Threads can access all the address space by the process, and threads have no protection against each other (therefore, you should program threads safely).

Single and Multi-Threaded Processes

Each thread comprises of a thread ID, program counter, register set and a stack. The thread shares with other threads belonging to a process, the code section, the data section and OS resources such as open files. Each thread needs context, but not as much as the process needs, as it inherits some of the context of the process.

Threading has various benefits, such as responsiveness - other threads can continue if one blocks (e.g., on I/O), resource benefits - there's only one copy of the code and TCB, etc, economy - thread creation and management is less costly than that of processes, and better utilisations of multi-processor architectures (threads can run in parallel).

User Threads

Implemented by user level libraries, this provide threads to an application with the the support of the underlying OS. Here, the application handles its own threading, so the OS only sees one process. The application has a thread table, used in the application for keeping track of the thread state and context, in a similar way to how the kernel handles TCBs for processes. The state diagram for a thread looks similar to the one for a process:

User threads are usually fast and have low management overheads (as a library, the threads library is within the address space of the process, so calls to the threads library functions are fast). Application can also define management (i.e., scheduling) based on better information on the threads than a more general policy.

This method does have disadvantages, however. When a thread calls I/O, the whole process is blocked, as the OS sees the whole process as waiting for I/O. When an application has threads that are likely to block often, then user threads may not be the best option.

Kernel Threads

Kernel threads are implemented at the OS level, and therefore has higher overheads (a thread control space/table is now needed in the TCB) and is slower, due to syscalls. With kernel threads, the OS schedules all threads, so it can schedule appropriately (e.g., when one thread blocks, only move that into waiting instead of the whole process). Threads can also be scheduled over multiple processes, so we have parallelism, instead of concurrency (which can introduce its own set of problems).

Hybrid Threading

Both user and kernel threading has their advantages and disadvantages, so using the hybrid methods give you the best of both worlds. Most real implementations of threading are hybrid.

A common approach is to multiplex a number of user threads onto a number of kernel threads; each kernel thread has some set of user threads that take turns using the kernel thread. There is usually a maximum number of kernel threads per user process.

Many-to-One

This is the same as normal user-level threading (only with management in the OS); many user level threads are mapped to a single kernel thread. All thread management happens in user-space, so this is efficient, but it does suffer from the one thread blocking the whole process problem.

You could also implement a kernel thread as a process, which is used on systems that don't support kernel threads.

One-to-One

All threads here are kernel threads - each user thread maps directly on to one kernel thread. This has more concurrency than the many-to-one model, as when one thread blocks another can run. This does have the overhead of having to create a kernel thread for each user thread, however.

Many-to-Many

This is the normal method of implementing threading. Here, the user library controls all - which threads are in the kernel, which are not and also controls swapping, etc. The library tries to control the threads so that not all of the kernel threads end up blocked at the same time, by swapping in and out waiting user threads to kernel threads.

This has the advantages of there being control over concurrency in the application, and the application can allocate user threads to kernel threads in a way to optimise the concurrency and ensure that at least one user thread is runnable at any time.

This method is quite complex, however and there are overheads in both the OS and the application. The user thread scheduling policy may also conflict with the kernel thread scheduling policy, introducing inefficiencies.

Threading Issues

fork() creates a seperate duplicate process and exec() replaces the entire process with the program specified by its parameter. However, we need to consider what happens in a multi-threaded process. exec() works in the same manor - replacing the entire process including any threads (kernel threads are reclaimed by the kernel), but if a thread calls fork(), should all threads be duplicated, or is the new process single threaded?

Some UNIX systems work around this by having two versions of fork(), a fork() that duplicates all threads and a fork that duplicates only the calling thread, and usage of these two versions depends entirely on the application. If exec() is called immediately after a fork(), then duplicating all threads is not necessary, but if exec() is not called, all threads are copied (an OS overhead is implied to copy all the kernel threads as well).

Thread Cancellation

Cancellation is termination of the thread before it is complete (e.g., if multiple threads are searching a database and one finds an answer, the others can stop).

There are two types of cancellation: asynchronous and deferred (or synchronous). With asynchronous, the calling thread can immediately terminate the target thread, but in deferred cancellation, the target thread is information that it is to be terminated by setting some status in the target thread. When the target thread consults its status, it finds out it is to terminate and performs some clear-up and terminates itself. This requires some co-ordination between threads.

Asynchronous cancellation is useful if the target process does not respond, otherwise it should not be used, as it can lead to inconsistencies.

Signal Handling

This is used in UNIX to notify a process an error has occured. It is normally generated by an event and is delivered to the process. Once delivered, the signal must be handled.

Synchronous signals are delivered to the same process that performs the operation that caused the signal, e.g., divide by 0 errors, etc...

Aysnchronous signals are delivered to running processes and are generated by something external to the process, e.g., a key press.

The signal can be handled by a default signal handler, or a user-defined signal handler. User-defined overrides the default handler (in most cases).

In multi-threaded systems, we have to consider the case of to which thread should the signal be delivered?

  • To the thread where the signal applies (how do you know this?)
  • To every thread (has a lot of overhead)
  • To certain threads (which ones?)
  • To a designated thread, which handles all signals (has the problem of there being the overhead of another thread, and the thread may not be scheduled in the best way to handle the signal)

CPU Scheduling

CPU scheduling refers to the assignment of the CPU amongst processes that are ready. It selects from among the processes in memory that are ready to execute and allocates the CPU to one of them. In terms of process state, the scheduler decides which of the processes that are in the ready state should be allocated to the CPU and enter the running state.

The CPU scheduling decisions may take place when a process:

  1. switches from the running to the waiting state
  2. switches from the running to the ready state
  3. switches from the waiting to the ready state
  4. terminates

A scheduling decision may also take place as the result of a hardware interrupt.

We call scheduling under conditions 1 and 4 nonpreemptive - once the program is running, it only gives up the CPU voluntarily. Scheduling under conditions 2 and 3 is preemptive - that is, the running process can be preempted by an interrupt. Problems may arise when a program is preempted whilst accessing shared data, which gives us a need for mutual exclusion.

Dispatcher

The dispatcher module gives control of the CPU to the process selected by the scheduler. This involves:

  • switching context
  • switching to user mode
  • jumping to the proper location in the user program to restart that program

With the dispatcher, we have a new metric, dispatch latency, which is the time it takes for the dispatcher to stop one process and start another running. Dispatch latency is optimal when it is minimised.

The dispatcher is an example of a mechanism, which is how you should implement scheduling. Policies decide what will be done, and the seperation of policies and mechanisms is an important OS principle, allowing flexibility if policy decisions are to be changed later.

Process Scheduling Queues

This is based upon the states in the process state transition graph. The OS needs to find the TCBs of all processes efficiently, and needs to find a process waiting on a particular event quickly when that event occurs. Specifically, the OS needs to find the next process to dispatch (i.e., from the ready queue) quickly on a context switch. You also need to be able to move processes between queues efficiently.

A process is only in one queue at a time. For example, you may have a ready queue, which is the set of all processes residing in main memory, ready to execute, and also a number of waiting queues, one for each different I/O device, which is the set of processes waiting on I/O from that device. Processes move between queues as they change state. Pictorally, this may look like this:

With this model, we have two types of schedulers, the short-term scheduler (or CPU scheduler), which selects which process should be executed next from the ready queue and allocates CPU. It is invoked very frequently (every 100 or so milliseconds), and must be very fast. Additionally, we have the long-term scheduler (or job scheduler). This selects which new processes should be brought into the ready queue, and is invoked very infrequently (in terms of seconds and perhaps even minutes). The long-term scheduler controls the degree of multi-programming (i.e., the number of processes in memory). However, how do we decide which jobs should be brought in from main memory?

Processes can be described as being either I/O-bound (spends more time doing I/O than computations, and has many short CPU bursts) or CPU-bound (spends more time doing computation, has a few very long CPU bursts). The long term scheduler tries to maintain a balance between these two forms.

Some OSs also have a medium term scheduler, which swaps in/out partially executed processes.

Scheduling Policy

When the OS needs to select another process to execute, the scheduling policy decides which of the processes to assign to the CPU. Scheduling policy effectively determines the order of the ready queue so that the next process to run is at the head of the queue.

Scheduling policies are often defined on the CPU-I/O burst cycle. Most process executions consist of a CPU execution and an I/O wait. The characteristics of this cycle help determine policies. The relationship between relative sizes of CPU and I/O bursts are very important. I/O bursts are very long compared to CPU bursts in most applications. A histogram of CPU-burst times gives us a typical CPU-burst graph, showing many short CPU bursts and few long CPU bursts, which are the CPU bound processes.

Some of the goals of scheduling policies are:

  • CPU utilisation high (keeping the CPU as busy as possible)
  • Throughput high (the number of processes that complete their execution per time unit)
  • Turnaround time low (the amount of time it takes to execute a particular process)
  • Waiting time low (amount of time a process has been in the waiting queue)
  • Response time low (amount of time it takes from when a request was submitted until the first response is produced, not output - for time-sharing environments)

Turnaround time alone is a bad goal to optimise towards, as overall processes will take longer.

First-Come, First-Served Scheduling

In First-Come, First-Served (FCFS) scheduling, processes are scheduled in the order that they arrive in the ready queue.

The average waiting time for FCFS is not optimal, however, as it suffers from the convoy effect, where short processes are delayed behind long processes.

Shortest-Job-First Scheduling

Shortest-Job-First (SJF) scheduling works by associating each process with the length of its next CPU burst, and then use these lengths to schedule the process with the shortest time next.

This has two schemes: non-preemptive (once the CPU is given to the process, it cannot be preempted until it completes its CPU burst); and preemptive (if a new process arrives with CPU burst length less than the remaining time of the current executing process, preempt), which is sometimes known as Shortest-Remaining-Time-First (SRTF) .

SJF is optimal for giving a minimum averasge waiting time for a given set of processes.

Determining the length of the next CPU burst can only be done by estimating the length. One way of accomplishing this is by using the length of previous CPU bursts and exponential averaging with the forumla τn + 1 = αtn + (1 - α)τn, where tn = the actual length of the nth CPU burst, τn + 1 = predicted value for next CPU burst and 0 ≤ α ≤ 1.

Priority Scheduling

A priority number (integer) is associated with each process and the CPU is allocated to the process with the highest priority (the smallest integer). This can be either preemptive or non-preemptive. SJF is a priority scheduling policy if the priority is predicted next CPU burst time.

This does have the problem of starvation, where low priority processes may never execute, and the solution to this is aging, where as time progresses, the priority of the process is increased.

Round Robin

Each process gets a small unit of CPU time (a time quanta), which is usually about 10-100 ms. After this time has elapsed, the process is preempted and added to the end of the ready queue. A smaller time quanta means more context switches and therefore more overheads, so the time quanta needs to be set to some appropriate value that is large with respect to the size of the context switch, but not too high as to basically become a FIFO policy. If there are n processes in the ready queue and the time quantum is q, then the process gets 1/n of the CPU time in chunks of at most q time units at once. No process waits more than (n - 1)q time units.

Average turnaround time does not necessarily improve as the quantum increases, so quantum size is a trade-off between good turnaround and low overheads.

Multi-level Scheduling

Here, the ready queue is partitioned into seperate queues, e.g., a foreground queue (for interactive processes) and a background queue (for batch processes). Each queue has its own scheduling algorithm, for example the foreground one is a RR queue and the background one may be FCFS.

Each queue must also be scheduled for CPU time, for example we could have a fixed priority scheduling algorithm (serve all from foreground, then from background), but this does have the possibilty of starvation, or we could have a time-slice, where each queue gets a certain amount of CPU time which it can schedule among its processes, i.e., 80% to foreground and 20% to background.

Multi-level Feedback Queue

A process can move between the various queues in a multi-level scheduling policy, and this is a way of implementing aging. The multilevel-feedback-queue scheduler is defined by the following parameters:

  • number of queues
  • scheduling algorithms for each queue
  • method used to determine when to upgrade a process
  • method used to determine when to downgrade a process
  • method used to determine which queue a process will enter when that process needs service

Thread Scheduling

For user-level threads, any general purpose scheduling policy can be used according to the best policy for that application. General user-level thread implementations should allow the application sufficient freedom to define an application-specific scheduling regime, i.e., allow the application to reorder the thread scheduling queues at any time, with any policy.

Memory Management

Processes (including threads) need physical memory in order to execute. The OS manages memory and allocates memory to processes from physical memory, and virtual memory. Physical memory management enables the OS to safely share physical memory among processes. This allows us to multiprogram the processor, to time-share the system (overlap computation and I/O) and to prevent one process from using another memory.

Memory management techniques must meet the following requirements:

  • Processes may require more memory than available physical memory
  • You may need to load processes to different locations each time they execute
  • You may need to swap all or part of a process to secondary storage during execution (and then reload it at a different memory location, as above)
  • Memory protection between processes is required

Some simple memory management techniques include loading the process into memory and switching between them with no protection between the processes (DOS), or copying the entire process memory to secondary storage when it performs I/O and then copy it back when it is ready to restart (early UNIX). This is clearly inefficient.

Most modern OSs work by giving each process a part of memory that it is allowed to access and check each memory access to ensure that the process stays within its own memory.

Physical Memory Allocation

Main memory must accomodate both the OS and user processes. Usually, the main memory is divided into a number of distinct partitions, such as the resident OS, which is normally a single partition held in low memory, as this is where CPUs place interrupt vectors. User processes normally consist of one or more partitions, within memory above the OS. However, we need to consider how we're going to divide up these user partitions.

The simplest scheme is to partition into regions with fixed boundaries - this is called fixed partitioning and has two types, where the partitions are of equal size, or where they are of unequal size. Any process whose size is less than or equal to the partition size can be loaded into an available partition. If all the partitions are full, the operating system can swap a process out of a partition. However, if a process doesn't fit into a partition, it is the programmers responsibility to make sure that the program can be split over multiple partitions. With this, main memory use is inefficient. Any program, no matter how small, occupies an entire partion. This is called internal fragmentation.

With unequal partitions, the problem is reduced, but not eliminated. Instead of creating x partitions of equal size, partitions of varying sizes are created, so a program can be loaded into the one most appropriate to its size.

Another method is dynamic partitioning, where the partitions are of variable length and number and are created when the process asks for them - so, the process is allocated exactly as much memory as it requires, eliminating internal fragmentation. It does however leave holes in memory, as processes are swapped in and out in different places, leaving small gaps between processes. This is called external fragmentation and could lead to a situation where there is enough free memory for a process to run, but it is fragmented across main memory. The way to solve this is using compaction - rearranging the programs in memory so they are contiguous and all free memory is in one block. This is very, very slow.

There are different ways of implementing allocation of partitions from a list of free holes, such as:

  • first-fit: allocate the first hole that is big enough
  • best-fit: allocate the smallest hole that is small enough; the entire list of holes must be searched, unless it is ordered by size
  • next-fit: scan holes from the location of the last allocation and choose the next available block that is large enough (can be implemented using a circular linked list)

First-fit is usually the best and fastest, but analysis has shown that under first-fit, fragmentation can cause up to half of the available memory to be lost in holes (external fragmentation) or unused memory (internal fragmentation). Best-fit has the smallest amount of external fragmentation. Next-fit can produce external fragmentation similar to best or first, depending upon the next block found. It does use holes at the end of memory more frequently than others perhaps leading to the largest blocks of contiguous memory.

A common approach, especially in embedded systems is the buddy system. Under the buddy system, the entire space is treated as a single block of 2U. If a process requests memory of size s, such that 2U-1 < s ≤ 2U, the entire block is allocated, otherwise the block is split into two equal buddies. This process continues until the smallest block greater than or equal to s is generated.

At any time, the buddy system maintains a list of holes of each size 2i. A hole on the i + 1 list can be split into 2 equal holes of size 2i in the i list.

Implementing Physical Memory Management

There are two main problems in implementing physical memory management - how can processes be run in different areas of memory, given that they contain lots of addresses, and how can the OS provide memory protection.

Memory addresses in a program must be translated into physical memory addresses, so programs are created to be relocatable, such as the ELF binaries for Linux. There are three points where binding of program address to memories can occur:

  • compile time - if the memory location is known at compile time, static absolute addresses can be compiled in; any starting address changes must be recompiled in.
  • load time - here, the code generated is relocatable and the code can be loaded at any address and the final binding occurs when the program is loaded. If the memory locations change, the code just needs to be reloaded.
  • execution time - the binding is delayed until runtime if the process can be moved from one memory segment to another during execution. This method does require hardware support.

To work round this, we have the concept of a logical address space that is bound to a seperate physical address space, which is the key to memory management. Logical addresses are the ones generated by the CPU, and physical addresses are the ones seen by the memory unit. When using compile or load-time address-binding schemes, the logical and physical addresses are the same, but with the execution-time addres-binding scheme, the logical (virtual) and physical addresses differ. This is accomplished using a memory management unit.

The memory management unit (MMU) is a hardware device that maps a logical address to a physical one, so the user program only deals with logical addresses, and never sees the physical one.

The most simple form of MMU is the relocation register. The relocation register holds the value of an offset and then adds the offset to all logical addresses given to it to give the physical address. In this system, logical addresses are in the range 0..max for all processes and physical addresses are in the range R..R+max, where R is the value in the relocation register. The process only needs to generate the logical addresses, and only sees the process running in addresses 0..max. This principle of mapping is central to memory management.

The relocation register can provide memory protection with the addition of a limit register, which contains the allowed range of logical addresses - each logical address must be less than the limit register. Hardware support is obviously required for this.

Another way of thinking of this is not thinking of a program as one large chunk of contiguous memory - a process may be broken up into pieces that do not need to be loaded into main memory all the time. This increases opportunities for multiprogramming as with many processes in main memory, it is more likely a process will be in the ready state at any particular point. This is achieved by programming techniques with minimal OS support.

Static vs. Dynamic Loading

So far, we have assumed that the entire program data and code must be loaded into memory for process to execute - this is called static loading. An alternative approach is called dynamic loading which works by not loading routines until it is called. This gives a better memory-space utilisation and unused routines are never loaded. This is useful when large amounts of code are required to handle infrequently occuring cases. If dynamic loading is implemented through program design, no special support from the OS is required.

Dynamic loading enables a small part of the process to be loaded initially - usually around the program entry point with part of the data and for a greater degree of multiprogramming as only parts of the program are needed at any time in physical memory. It is usually the responsibility of the application to ensure that the appropriate parts of the application are loaded at any time.

Some OSs do provide libraries to aid loading parts of a program.

Overlays

These are slightly more structured, and codes are put into phases that are loaded over one another. Again, the motivation is to keep only code and data in memory that is actually needed. When one phase completes, the application arranges to overlay its memory with the next phase when required. The total memory required is the same as the memory required by the largest phase. Some OS support may be given to implement this.

This is useful when there is a limited amount of physical RAM, and some secondary storage is available to hold code/data, e.g., in embedded systems.

Shared Libraries

Shared libraries can be implemented in two ways - using static or dynamic linking. In static linking, all libraries required by the program are linked into the image (i.e., form part of the data/code of the program), so the memory allocated must be big enough to hold the application and the libraries.

With dynamic linking, the libraries required by the program are dynamically linked at execution time (just-in-time), so the memory allocated must be big enough to hold the application. Shared libraries are allocated from a different area of memory.

Swapping

A process (or part of a process) can be swapped temporarily out of memory to a backing store (e.g., disk) and then brought back into memory for continued execution. This is how medium-level scheduling is used. If you want to reload at a different address, execution time binding is required.

The issue with swapping is the backing store - a fast disk large enough to accomodate copies of all memory images for all users that must provide direct access to these memory images is required. A major part of the swap time overhead is the transfer time. The total transfer time is directly proportional to the amount of memory swapped.

One of the more common swapping variants is called roll-out, roll-in and it is used with priority-based scheduling algorithms. Lower-priority processes are swapped out so higher-priority processes can be loaded and executed.

Many different varieties of swapping can be found on many OSs, such as UNIX, Linux, Windows, etc...

Paging

The problems with using physical memory allocation with unequal, fixed or variable sized partitions are fragmentation and inefficiencies (algorithms to track holes, and compaction). Paging and segmentation are memory allocation techniques that largely eliminate these problems.

Many of these problems arise as memory allocation is of contiguous blocks of memory to processes. Paging removes this restriction, as the logical address space of a process can be noncontiguous. The physical memory space is divided into fixed-size blocks called frames, and logical memory is divided into blocks of the same size called pages. Pages are then allocated and fit into frames.

To implement paging, we need to keep track of all free frames. To run a program of size n pages, we need to find n free frames and load the program into them. There is a problem with internal fragmentation if a process size is not an exact number of frames. Only part of one frame is lost to this fragmentation, however, and if we make the frame size quite small, this isn't a significant problem.

With paging, the logical address generated by the CPU is divided into a page number p and a page offset d. The page number is used as an index into a page table which contains the base address of each page in physical memory and the page offset is combined with the base address to define the physical memory address that is sent to the memory unit.

Usually, page tables are a fixed length, often the same size as a frame. If a process does not require all entries in the page table, the redundant entries are set to be invalid. If a reference is made to an invalid page table entry, the OS must handle a page fault and there is the potential for an erroneous logical address to have been generated by the application process.

The page table is normally kept in main memory, so in this scheme every data/instruction access requires two memory lookups, one for the page table and then one for the actual instruction. The two memory access problem can be solved using a special peice of hardware called a translation look-aside buffer (TLB).

TLBs are implemented using associative memory, which allows for a parallel search of page numbers in order to get the frame number. The TLB is not a part of normal physical memory.

Memory protection in paging is implemented by associating a protection bit with each frame. The valid-invalid bit is attached to each entry in the page table, where "valid" indicates that the associated page is in a process' legal address space and is thus a legal page, whereas "invalid" indicates that the page is not in the legal address space of the process.

Modern systems allow processes to have large logical address spaces, so the page table itself can become very large. Hierarchial page tables can be used to break up the logical address space into multiple page tables.

See hash tables
in ADS

In address spaces > 32 bits, hashed page tables have become common. Here, the virtual page table is hashed into a page table.

Often, having one page table per process can be expensive in terms of physical memory. The alternative is the inverted page table, where there is one entry for each real page of memory and the entry consists of the virtual address of the page stored in that real memory location, with information about the process that owns that page. This decreases memory usage, but increases the time needed to search the table when a page reference occurs. Hash tables can be used to limit the search to one, or at most a few, page-table entries.

Segmentation

Paging seperates the user's view of memory from actual physical memory. A user thinks of the program as a collection of segments. A segment is a logical unit, such as: the main program, procedures, functions, methods, objects, local variables, global variables, common blocks, the stack, the symbol table and arrays. Segmentation is a memory management scheme that supports a user view of memory.

In segmentation, you have a logical address consisting of a segment number and an offset and a segment table that maps two-dimensional physical addresses. Each table entry has a base that contains the starting physical address where segments reside in memory and the limit that specifies the length of the segment.

With each entry in the segment table, a validation bit (one that indicates whether or not it was an illegal statement) and read/write/execute privileges are associated with the entry. Since segments can vary in length, memory allocation is a dynamic storage-allocation problem, which can lead to external fragmentation.

Segmented Paging

Both paging and segmentation have advantages and disadvantages and most succesful processor families are moving towards a mixture of segmentation and paging. Segmented paging usually uses a segment table, a multi-level page table and an offset.

Virtual Memory

Virtual memory is the seperation of user logical memory from physical memory. Only part of the program needs to be in memory for execution (i.e., only the part needed for current execution - follows from the principle of locality, where programs and data references inside a process tend to cluster and only a few pieces of a process will be needed in any period of time). The major advantages of virtual memory is that the logical address space can be much larger than the physical address space and more processes can be partly in physical memory at any time.

Virtual memory can be much larger that physical memory, with n pages of virtual memory mapped to m pages of physical memory and other pages on secondary storage.

Virtual memory can be implemented assuming a virtual address space of 0..max, where max is usually a power of 2, e.g., could be 232 bytes, or 4 GiB, in 32-bit systems. The process can be loaded anywhere in physical memory and often the physical memory size is significantly smaller than the virtual address space of an individual process.

Physical memory nowadays is very large, so virtual memory is only useful for large data sets, and for compilation models.

Demand Paging

Demand paging brings a page into memory only when it is needed. This means there is less I/O, less memory, a faster response time and more multiprogramming.

A page is needed when there is a reference to it (i.e., an instruction/data address). An invalid address will lead to an abort, and a not-in-memory request will lead to a fetch to memory - this is called a page fault.

When the first reference to a page occurs, the page will not be in memory under this system, so a page fault occurs. This is a non-maskable interrupts (can't be covered up by the user or OS) and it is synchronous, you need to know the running process that caused the interrupt. This allows for the OS to bring the appropriate page into main memory and for it to detect invalid memory references - the OS contains a range of valid addresses for a process. The process often does not need the full range of virtual memory, hence code size, data size and stack size can be used to check if a memory reference is valid. This information is stored on a per process basis in the TCB.

When a page fault occurs, the OS looks at the internal table to decide if the process is permitted to make the memory reference and if it is, it finds an empty frame and swaps the page into that frame, updates the page table and the internal OS tables and continues the instruction that caused the page fault - as with other standard interrupts.

Pure demand paging is where you start executing a program with no pages loaded into memory. This will trigger a page fault immediately, however.

Demand paging is more complex when a TLB is involved. When a virtual address is given, the processor consults the TLB, and if a TLB hit occurs, the real address is returned and main memory is consulted in the usual manner. If the page table entry is not found in the TLB (a miss), the page number is used to index the process page table, and then a check occurs to see if the page is in main memory. If it's not, a page fault is issued and the non-TLB process as above is followed, with the TLB being updated as well as the page table.

Page Replacement

Page replacement is needed when there are no free physical memory frames.

When we want to load a page into memory, we try to find a free frame to put the page into. If there is no free frame, then we use a page replacement algorithm to select a victim frame, and then the victim is copied back onto disk and the new frame is loaded into its place and the page and frame tables are updated.

The aim for designing a page replacement algorithm is to give the lowest page-fault rate. An algorithm is evaluated by running a particular string of memory references and computing the number of page faults on that string. An ideal page fault graph would look like:

Some algorithms, however give more page faults if the number of pages is increased - this is called Belady's Anomaly.

The simplest page replacement algorithm is the first-in-first-out method, which can be implemented using a circular linked list.

Another algorithm is the least recently used (LRU) algorithm, which replaces the page that has not been referenced for the longest time. By the principle of locality, this should be the page that is least likely to be referenced in the future. Each page could be tagged with the time of last reference, but this would require a great deal of overhead. LRU can be easily implemented using a stack of page numbers; when a page is referenced, it is moved or placed on the top of the stack, and when looking for something for a replacement, the bottom of the stack is all that you need.

The second chance algorithm uses a reference bit, which is initially set to 0 and then is updated to 1 when it is accessed. The second chance algorithm uses a circular list of pages. When searching for a page to replace, if a page has a reference bit one, it is set to 0 and the page left in memory, and the next page in circular order is replaced subject to the same rules.

We can also consider counting algorithms, where a counter is kept of the number of references that have been made to each page. The least frequently used (LFU) algorithm replaces the page with the smallest count. The most frequently used (MFU) algorithm is based on the algorithm that the page with the smallest count was probably just brought in and has yet to be used.

The optimal algorithm replaces the page that will not be used for the longest period of time. It is impossible to have perfect knowledge of future events, but if you already know what's happening (as with a reference string), you can use this algorithm to measure how well your algorithm performs.

The OS may wish some frames to remain in physical memory frames, such as parts of the OS itself, control structures, I/O buffers, etc..., which is why we can use page and frame locking. When a frame is locked, it may not be replaced. This usually requires a lock bit to be added with each frame, and is usually supported in the OS, not hardware.

Each process requires a minimum number of pages. If indirect referencing is used, each reference requires a page, so each level of references increases the amount of pages required.

Frame Allocation

Again, there are different methods of allocating frames. If there are 100 frames and 5 processes, each process can be given 20 pages. This is called equal allocation. A different way is called proportional allocation, where allocation is given according to the size of the process: ai, the allocation for process pi = si/S × m, where si = size of process pi, S = Σsi and m = total number of frames.

A variation of the proportional allocation scheme is the priority allocation scheme, where the proportions are computed based on priorities, rather than size (a higher priority process will execute quicker if it has more frames).

If a process pi develops a page fault you can either select a replacement from one of its frames or a replacement from a process with a lower priority number. This brings us to the question of global or local replacement. Global replacement will select a replacement frame from the set of all frames, so one process can take a frame from another. In priority based allocation schemes, this may allow higher priority processes to increase their allocation of frames by taking a frame from a low priority process. Problems arise as the page fault behaviour of one process will also depend on the behaviour of other processes.

In local replacement, each process selects from only its own set of allocated frames, so the process' page fault behaviour is entirely dependent upon itself, and not upon other processes. The process execution time becomes more consistent across multiple executions.

Thrashing

If a process does not have enough frames allocated, the page fault rate can get very high, so a lot of swapping back and forth between main memory and secondary storage is done, and the result is thrashing, which is where more work is done by the OS swapping and paging than is done by the application in doing something useful. This has severe performance implications.

A common form of thrashing is due to the OS monitoring utilisation and many page faults occur, decreasing utilisation. As utilisation decreases, the OS increases the degree of multiprogramming to try and increase it, which in turn increases the page faults and decreases utilisation...

Thrashing occurs under global frame replacement more as frames can be stolen from other (active) processes. Local frame replacement limits thrashing to within individual processes. If a process starts thrashing, it cannot steal frames from another process and cause it to start thrashing.

To prevent thrashing, a process must allocate a sufficient number of frames. We can predict a sufficient amount of frames using the locality model. The diagram below shows how memory accesses change over execution of a program.

Thrashing will occur when the size of the locality > memory (pages) allocated to the process.

The working set model assumes locality - it defines a per-process working set window which contains the most recent page references for a process. If a page is in active use, it will have been accessed in the working set window, and will be in the working set. If a page is not being used, it will drop from the working set sometime after its last reference. The working set approximates the locality of a process.

Formally, we have Δ, the working set window width in time and WSSi, the working set of process Pi is the total number of pages accessed in the most recent Δ. The working set will vary with time, and if Δ is too small, it will not cover the entire locality, and if Δ is too large, it'll cover several localities. If Δ = ∞, it will cover the entire program. The most important property is the total size of the working sets in the system. D = ΣWSSi. If D > m (the total number of pages), thrashing will occur, and a process must be suspended.

The working set can be implemented with an interval timer and a reference bit. For example, if Δ = 10,000, each process in the OS has bits to indicate whether it has been accessed in the last 5000, 10,000 or 15,000 cycles. On a page access, the 5000 bit is set, and a timer interrupt every 5000 cycles shifts all the bits along one, with the 5000 reference bit set to 0. If any of the bits in memory = 1, the page is in the working set.

This is not entirely accurate, as you cannot tell within an interval of 5000 when a page reference actually occured. You can improve this by interrupting more frequently and increasing the number of reference bits, but this has a greater overhead and reduces the time available to do useful work.

The working set model is quite clumsy and expensive to implement, and page-fault frequency is better. An "acceptable" page-fault rate is better, and if the page fault rate travels outside of these bounds, then the process loses a frame, and if the page-fault rate gets too high, the process gains a frame.

Inter Process Communication

We need processes to co-operate to carry out particular jobs, e.g., one process requests a service from another (such as requesting an I/O operation from the kernel, as the kernel is still a process), pipelined processes working together (one process can't proceed until the previous one has finished) and where multiple processes work on the same task, for example one might read from disk to memory, and another reads from memory to the audio device. This can be accomplished using inter process communication (IPC).

IPC has numerous advantages, such as:

  • Information sharing (e.g., of shared files)
  • Computation speed-up (e.g., a program can be split across multiple computers)
  • Modularity (for software engineering reasons)
  • Convenience (e.g., to edit, compile and print in parallel)

Independent processes cannot affect or be affected by the execution of another process (all processes are in a seperate protected memory address space). Co-operating processes can affect or be affected by the execution of another process, via some form of IPC mechanism.

The OS must solve some of the problems that IPC causes, however. It must provide mechanisms to control interaction betwen the processes and must order accesses to shared resources by processes. The OS must also provide mechanisms by which resources can wait until the resource is available.

A paradigm for co-operating processes is the producer-consumer problem. A producer process produces information that is consumed by a consumer process. There are two models which can be used here, one is the unbounded-buffer, which places no practical limit on the size of the buffer, and the buffer is just expanded as the producer fills it and the other is the bounded-buffer, which assumes there is a fixed buffer size and the producer must wait for the consumer to remove data if the buffer is full.

See lecture notes
for full details

One way to implement the bounded-buffer solution is to use a circular array/list, but this only works if the processes can share memory. Another implementation that maximises the use of the buffer is to use an integer called a counter which is then incremented when a producer writes to the buffer and decremented when a consumer reads from it. These operations must be performed atomically though, which isn't always the case if a context switch occurs during the increment and decrement process.

We now have a race condition, which is the situation where several processes access and manipulate shared data concurrently and the final value of the shared data depends on which process finishes last, so the data may become inconsistent. To prevent race conditions and to ensure consistent data, concurrent processes must be synchronised. This requires mechanisms to ensure the orderly execution of co-operating processes.

Critical Sections

Here, we can use critical sections. If we have n processes competing to use some shared data, each process has a code segment called the critical section, in which the shared data is accessed. The problem is to ensure that when one process is executing in its critical section, no other process is allowed to execute in its critical section. Solutions to this problem include:

  • Mutual exclusion (mutexs) - if a process is executing in its critical section, then no other processes can execute in their critical sections
  • Progress - if no process is executing in its critical section and there exists some program that wish to enter their critical section, then the selection of the processes that will enter their critical section next can not be postponed indefinately
  • Bounded waiting - a bound must exist on the number of times that other processes are allowed to enter their critical sections after a process has made a request to enter its critical section and before that request is granted

One way to accomplish this is to use command such as EnterCriticalSection() and LeaveCriticalSection(). Processes may share some common variables to synchronise their actions. For example, we could have an integer value called turn, which if it matches the value associated with that process, then the process can enter its critical section and then increment the turn counter. This satisfies the mutual exclusion requirement, but not progress.

A second algorithm is to use guard flags, which consists of an array of flags. When a process is ready to enter its critical section, it sets its flag to true and waits until all other processes have completed their critical sections and then sets its flag to false. This satisfies mutual exclusion again, but not progress.

The final algorithm is a mixture of turn and flags, which waits for the turn of a process that is waiting to enter its critical section to finish and then enter its own. This meets all three requirements, and solves the critical section problem for two processes.

However, we need to solve the problem for n processes, and one way of implementing this is to use the Bakery (or Lamport) algorithm. Before entering its critical section, a process receives a number, and the holder of the smallest number enters its critical section. If processes i and j receive the same number, then if i < j, then i is served first, else j is served first. The number scheme used always generates numbers in an increasing order of enumeration.

As in many aspects of software, hardware can make programming easier. Hardware exists that can solve the critical section problem. If an operation TestAndSet existed, which could test and modify the contents of a word atomically, mutual exclusion could be implemented. This does not satisfy the bounded wait requirement, however.

The software algorithms discussed above for exclusion of processes from critical sections are problematic as they rely on the continuous testing of variables (busy waiting), which is inefficient and wasteful of CPU cycles. The protocols are complex and unenforceable, relying on programmer discipline and co-operation. The OS implements efficient IPC mechanisms instead.

Signals

We've already discussed
signal handling in threads
in the threads section.

POSIX/Linux signals are a dataless communication either between processes, or between the OS and a process. Signals can be thought of as software interrupts and require a handler within a process. When the event that causes the signal occurs, the signal is said to be generated and when the appropriate action for the process in response to the signal is taken, the signal is said to be delivered. In the interim, the signal is said to be pending. Signals consist of a number of types, such as system defined (normally up to 31) corresponding to a particular event, or user-defined (normall integers > 31). Some of these signals include:

  • SIGINT (2) - interrupt generated from a terminal
  • SIGILL (4) - interrupt generated by an illegal instruction
  • SIGFPE (8) - floating point exception
  • SIGKILL (9) - kill a process
  • SIGSEGV (11) - segmentation violation
  • SIGALRM (14) - alarm clock timeout
  • SIGCHLD (17) - sent to parent when child exits

Synchronous signals occur at the same place, every time in the program (e.g., SIGFPE, SIGILL, SIGSEGV) and are caused by the program itself. Asynchronous signals may occur at any time from a source other than the process (another process e.g., SIGKILL, a terminal driver on a key press e.g., SIGINT on Ctrl+C, on an alarm clock, e.g., SIGALRM) and a process can choose how to handle the specific signal.

A signal can be treated in the following ways, it can be ignored (except for SIGKILL), the default action set by the OS can be executed (the default action is decided by the signal used, such as terminating the process, suspending the process, terminating the process and saving a copy in memory, continue a suspended process or just ignore), or a specific signal-catching function can be invoked.

There are essential system calls for signalling:

  • kill(pid, signal) - send a signal of value signal to the process with ID pid.
  • sigaction(handler, signal) - installs a handler for a specific signal (handlers are specific to each process)
  • sigprocmask(..., mask, ...) - signals can be blocked or masked in much the same manner as hardware interrupts can be masked by the OS
  • sigpending(mask) - signals that are raised whilst they are masked out can be examined
  • sigsuspend(mask) - a process can wait until a signal arrives

Semaphores

Semaphores are "a non-negative integer, which apart from initialisation can only be acted on by two standard, atomic, uninterruptible operations, WAIT (which decrements the value of the semaphore) and SIGNAL (which increments the value)" - E. Dijkstra

You must guarantee that a maximum of one process can execute a WAIT or SIGNAL operation on the same semaphore at the same time (atomicity).

There can be two types of semaphore, a counting semaphore (the integer value can range over an unrestricted domain) and a binary semaphore, where the binary value can range only between 0 and 1. This can be simpler to implement.

Semaphores can be used to implement mutual exclusion for the critical section of n processes. The shared data is a binary-semaphore mutex which is WAITed on when a program wants to enter the critical section and then SIGNALed when the critical section completes.

Semaphores can also be used to implement synchronisation. For example, B can only be executed in program j only once A has executed in program i. When a semaphore is used to indicate a condition, it is called a condition variable and synchronisation using a condition variable is called condition synchronisation.

This can introduce two conditions, however: starvation, which is indefinite blocking when a process may never be removed from the queue in which it is suspended; and deadlock, where two or more processes are waiting indefinitely for an event that can only be caused by only one of the waiting processes.

Counting semaphores can be used for resource management. If you consider a resource with m allocatable elements, the counting semaphore is set to m and then users can wait for a resource by waiting on the semaphore and then free up a resource by signalling a semaphore.

The classic problems of synchronisation, such as the bounded-buffer problem, the readers and writers problem and the dining-philosophers problem can be solved using semaphores and mutexs.

OS level semaphores ensure mutual exclusion over a shared resource if and only if the application is disciplined. For example, the programmer should only access shared data if and only if it is holding the appropriate semaphores, and it should issue a signal to release the semaphores when it finishes the critical section.

Semaphore operations still require atomicity. If an operation is in the OS, it can be accessed by a process using a system call. When it is in the OS, the call will execute atomically, or at least seem atomic from the users point of view, if interrupts are enabled. True atomicity can only be achieved with hardware support, such as the TestAndSet instruction discussed earlier.

POSIX semaphores support two interfaces (Linux does not support these fully). Unnamed semaphores are used within a process (or between related processes) and are created by a call to sem_init(...) by a process. Operations include sem_wait(...) for a wait and sem_post(...) for a signal. sem_trywait(...) also exists, which works like wait if the semaphore has a greater count than 0, but doesn't wait if the semaphore is 0, returning immediately to the calling process. These semaphores are destroyed by a call to sem_destroy(...), or if the creating process dies.

Named semaphores can be used between any process. The namespace is the same as the filesystem (so the name of a semaphore is effectively a filename). These are created using sem_create(...) and then can be manipulated as above. sem_close(...) removes the link between a process and a semaphore, and they can live on past the termination of the creating process, so must be explictly destroyed using sem_destroy(...). Here, the OS merely stores a list of processes waiting on a particular semaphore (i.e., processes that are blocked on a semaphore), but hardware support is still required for atomicity.

Linux implements SysV semaphores (i.e., as it existed in System V UNIX), which are not POSIX compliant. They are system wide resources, managed by the OS, and any process can access a semaphore, which is stored in the semaphore table of an OS. Allocation occurs via semget(...), which creates a set (which may be of one) of semaphores or associate with an existing set of semaphores. All operations on semaphores occurs using semop(semaphore, op, ...) and semctl(...) allows you to examine or change the value of a semaphore, e.g., to initialise it, as well as telling you how many processes are waiting on the semaphore and allowing you to destroy it.

Semaphores can also be used to provide mutual exclusion between threads, such as provided by the Linux pthreads library which provides mutexes. pthread_mutex_init() provides initialisation for a mutex, pthread_mutex_lock() obtains a lock for a mutex and pthread_mutex_unlock() will give up a lock. The semaphores between pthreads are supported by the standard Linux kernel semaphores.

Conditional synchronisation can also be used, where a condition enables threads to atomically block and test a condition under the protection of mutual exclusion, until the condition is true. If the condition is false, the thread blocks on the variable and automatically releases the mutex. If another thread changes the condition, it may wake up waiting threads by signalling the associated condition variable. Waiting threads will wake up and then reacquire the mutex and reevaluate the condition. pthread_cond_init() will initialise a condition variable, which must be a mutex initialised by pthread_mutex_init(). pthread_cond_wait() and pthread_cond_signal() take a condition variable and a mutex as parameters.

Using the low-level IPC mechanisms mentioned above relies on programmer discipline for the correct use of semaphores to wait and signal in the correct order at the correct point in the algorithm. Additionally, it requires an agreement between communicators (i.e., between the reader and the writer) on the precise protocol to be used, and problems can be encountered if the writer assumes writers have priority and readers assume that the readers have priority over the shared data. Sometimes, hardware support is required to provide atomicity (TestAndSet, interrupt disabling, etc).

High-level IPC provides a structured IPC less susceptible to programmer error. It incorporates both the means of shared data control and the shared data itself, as opposed to low level IPC, such as semaphores, where the shared data control is distinct from the shared data (note, the POSIX implementation assumes that the shared data in a file could be accessed from processes without waiting for the appropriate semaphore).

Message Passing

Message passing is a mechanism for processes to communicate and synchronise their actions. In a message system, processes communicate with each other without resorting to shared variables. If P and Q wish to communicate, they need to establish a communication link between them and exchange messages via send/receive.

This does require OS support and a number of implementation issues must be resolved - where are messages stored whilst they are in transit, how many processes can join in with a communication, and is the message size fixed or variable, is the link uni or bi-directional, synchronous or asynchronous, etc...

With message passing, processes must name each other explicitly, such as: send(P, message), to send a message to process P and receive(Q, message) to receive a message from process Q.

With this method, links are established automatically and a link is associated with exactly one pair of communicating processes, and between each pair their exists exactly one link. The link may be unidirectional, but it is usually bidirectional.

A method of indirect communication involves using messages being directed and received from mailboxes. Each mailbox has a unique ID and processes can only communicate only if they share a mailbox. This introduces the problem of deciding who gets a message if a mailbox is shared, however. The primitives in this method are: send(A, message) and receive(A, message), which sends a receives messages from mailbox A.

In this case, a link is only established if processes share a common mailbox, and a link may be associated with many processes. Each pair of processes may share several communication links and the links may be uni- or bi-directional.

Mailboxes can be owned by either the OS or a process. If it is OS owned (in OS address space), the OS is independent and not attached to any process, so operations and system calls are required for the process to create a new mailbox, send and receive messages through the mailbox and to delete mailboxes. If a mailbox is process owned (in the address space of a process) the process with a mailbox receives messages and other processes send messages to the mailbox. An operation or system call is still required to write to the mailbox.

Message passing can be considered to either be blocking or non-blocking. Blocking is considered synchronous; in a blocking send, the sending process is blocked until the message is received by the receiver or a mailbox, and in a blocking receive, the receiver is blocked until the message is available. Non-blocking is considered asynchronous; in a non-blocking send, the sending process sends the message and then resumes execution, and for a non-blocking receiver, the receive call will either get a message or a NULL. The send and receive primitives may be either blocking or non-blocking, depening on the implementation.

The link (direct communication) or mailbox (indirect communication) will contain messages that have been sent or not received. This is called a buffer, and they can be implemented in more than one way:

  • Zero capacity - 0 messages are held, and the sender must wait for the receiver
  • Bounded capacity - here there is a buffer with a finite length of n messages and the sender must wait if the link is full
  • Unbounded capacity - infinite length, and the sender never waits

Client-Server Communication

When the processes are on different machines, network communication is required. Here, three forms of IPC are available: sockets, remote procedure calls (RPCs) and remote method invocation (Java). These will be discussed in the NDS course.

In POSIX, message queues between processes are used. Messages are stored in FIFO order, and the queue is set up so that either the process sends or receives blocking (on the full or empty queue respectively), or a the process sends or receives non-blocking. A process reading from an empty queue will block.

In POSIX, the process creates or connects to an existing one using mq_open(name, ...), where name is a filename in the same namespace as POSIX semaphores. Sending and receiving messages occur using mq_send(...) and mq_receive(...), and process can close its connection to a message queue by mq_close(...), however the message queue will still exist after this call. To destroy a message queue, mq_unlink(...) is used before mq_close(...), and the queue will be destroyed after all processes attached to the queue have unlinked.

High-level Synchronisation Mechanisms

Semaphores provide a convenient and effective mechanism for process synchronisation, but their incorrect use can result in problems difficult to detect. To combat these problems, a number of high-level synchronisation constructs have been proposed: critical regions and monitors.

These processes are not implemented at the OS level, but usually as part of the programming language. There is no atomicity at the application level.

Critical regions are a high-level synchronisation construct. A shared variable v of type T is declared as v : shared T. The variable v is only accessible inside the statement region v when B do S, where B is a boolean expression, which means that when statement S is being executed, no other process can access variable v.

Regions referring to the same shared variable exclude each other in time (i.e., mutual exclusion is enforced). When a process tries to execute the region statement, the boolean expression B is evaluated. If B is true, S is executed, but if B is false, the process is delayed until B becomes true and no other process is in the region associated with v.

Critical regions are implemented using semaphores. With the shared variable x, the following are associated: semaphore mutex, first-delay, second-delay; int first-count, second-count;. Mutually exclusive access to the critical section is provided by mutex. If a process can not enter the critical section because B is false, it initially waits on first-delay and is moved to second-delay before it is allowed to re-evaluate B. A track is kept on the number of processes waiting on first-delay and second-delay by using first-count and second-count respectively.

This algorithm assumes a FIFO ordering in the queueing of processes for a semaphore. For an arbitrary queueing discipline, a more complicated implementation is required.

Monitors are high-level synchronisation constructs that allows the safe sharing of an abstract data type among concurrent processes. The monitor encapsulates shared data together with procedures that act upon that data. Only the procedures can be accessed from outside the monitor; the data inside the monitor can not be directly accessed from outside the monitor. Only one process can be active inside a monitor at a time.

To allow a process to wait within a monitor, a condition variable must be declared as: condition x, y;. The condition variable can only be used with the operations wait and signal. x.wait() means that the process invoking the operation is suspended until another process invokes x.signal(), which resumes exactly one suspended process. If no process is suspended, then the signal operation has no effect.

A conditional-wait construct also exists: x.wait(c), where c is an integer expression evaluated when the wait operation is executed. The value is a priority number stored within the name of a process that is suspended. When x.signal() is executed, the process with the smallest associated priority number is resumed next.

Two conditions need to be checked to establish the correctness of a system:

  • User processes must always make their calls on the monitor in a correct sequence.
  • You must ensure that an uncooperative process does not ignore the mutual-exclusion gateway provided by the monitor and try to access the shared resource directly, without using the access protocols.

Deadlocks

Deadlocks occur when a set of blocked processes are each holding a resource and waiting to acquire a resource held by another process in the set. Deadlocks can be characterised by the bridge-crossing problem:

There is only one lane across the bridge. We can view each section of the bridge as a resource, and if a deadlock occurs, it can be resolved if one car backs up (pre-empt resources and rollback), but sometimes that will require the cars following it to be backed up also. Starvation can also occur, if only cars from one direction are allowed to pass.

Deadlock can arise if four conditions hold simultaneously in a situation:

  1. Mutual exclusion: only one process at a time can use the resource
  2. Hold and wait: a process holding at least one resource is waiting to acquire additional resources held by other processes
  3. No preemption: a resource can only be released voluntarily by the process holding it, after the process has completed its task
  4. Circular wait: there exists a set {P0, P1, ..., Pn} of waiting processes such that P0 is waiting for a resource that is held by P1, which is waiting for a resource that is held by P2, ..., Pn - 1 is waiting for a resource that is held by Pn, that is waiting for a resource that is held by P0.

We could plot a resource allocation graph which consists of a set of vertices V and a set of edges E, such that V is partitioned into two types, P = {P1, P2, ..., Pn}, the set of all processes in the system and R = {R1, R2, ..., Rm}, the set of all resource types in the system. Each process utilises a resource using: requests, a directed edge Pi → Rj, which if it can not be assigned immediately, the requesting process is blocked; and resource use, indicated by a directed edge Rj → Pi. That is, the process uses the resource. When a resource is released, edges are removed.

The above resource allocation graph has no deadlocks, as there are no circularities. P3 has all the resources it needs; when it frees R3 then P2 can proceed; when P2 releases R1, then P1 can proceed.

The above resource allocation graph does have a deadlock, however. Mutual exclusion holds over all of the resources, a number of processes hold resources and are waiting for others and no process wants to release a resource. P3 holds R3 and wants R2; P2 holds R1 and R2 and wants R3 and P1 holds R2 and wants R1. This is a circular wait, and we have all the conditions for deadlock.

If a graph contains a cycle, the following basic rules can be applied: if the graph contains no cycles, no deadlock occurs; if the graph contains a cycle, if there is only one instance per resource type, then deadlock occurs, otherwise if there are several instances per resource type, there is the possibility of deadlock.

Handling Deadlocks

The free basic methods for handling deadlocks are prevention/avoidance (ensuring that the system will never enter a deadlock state), detection and recovery (allows the system to enter a deadlock state and then recover from it) and the head-in-the-sand approach (ignore the problem and pretend that deadlocks never occur - this is the approach used by most OS, including UNIX, Linux and Windows).

To implement prevention, we need to ensure that one or more of the deadlock conditions do not hold:

  • Mutual exclusion is not required for sharable resources, but it must hold for non-sharable resources.
  • Hold and wait must guarantee that whenever a process requests a resource, it does not hold any other resources. The process must request and be allocated all of its resources before it begins execution, or the process can only request resources when the process currently has none. This does have the possibility of low resource utilisation and starvation.
  • No preemption - If a process that is holding some resources requests another resource that can not be immediately allocated to it, then all resources currently being held are released. Preempted resources are added to the list of resources for which the process is waiting and the process will only be restarted when it can regain its old resources, as well as the new ones it is requesting.
  • Circular wait - impose a total ordering of all resource types and require that each process requests resources in an increasing order of enumeration.

Deadlock algorithms prevent deadlock, so deadlock will never occur if (at least one of) the restrictions are enforced. But, there is a cost to this system. To restrain how resource requests are made can have the potential for low device utilisation and a reduced system throughput.

Alternatively, with some addition a priori information, deadlock can be avoided on a per request basis. The simple method is that each process declares the maximum number of resources of each type that it may need, or a dynamic alternative is to dynamically examine the resource-allocation state to ensure that there can never be a circular-wait condition. The resource allocation state is defined by the number of available and allocated resources, and the maximum demands of the process.

Systems States

When a process requests an available resource, the system must decide if immediate allocation would leave the system in a safe state - a state where there exists a sequence of all processes so that all processes can get their required resources and complete. If the system is not in a safe state, it is in an unsafe state, which means the possibility of deadlock exists (but is not necessarily going to happen). The deadlock state is a subset of the unsafe states.

Avoidance requires that the system will never enter an unsafe state, so an algorithm must pick a safe sequence of events and processes for execution. One such algorithm is the resource-allocation graph algorithm. A claim edge Pi → Rj indicates that process Pi may request resource Rj which is represented by a dashed line. A claim edge is converted to a request edge when a process requests a resource. When a resource is released by a process, the assignment reconverts to a claim edge. Resources must be claimed a priori in the system.

File Systems

Files are an integral part of most computer systems. Input to applications can be by means of a file, and output can be saved in a file for long-term storage. File systems can store large amounts of data, and the storage is persistent. The file management system is considered part of the OS.

The file managment system is a way a process may access files. The programmer does not need to develop file management software, this is provided by the OS. The objectives of a file management system are to meet the data management needs of the user, to guarantee the data in the file is valid, to minimise the potential for lost or destroyed data and to optimise performance. We can also break down what the user requirements for data management are into:

  • being able to create, delete, read and change files
  • controlled access to other users' files
  • control accesses allowed to that users' files
  • restructure the users' files in a form appropriate to the problem
  • move data between files
  • backup and recover the users' files in case of damage
  • access and share the users' files using symbolic names

To enable this, the file management system typically provides functions to indentify and locate a selected file, to organise files in a structure (often accomplished using directories) which describes the locations of all files and their attributes, and to provide mechanisms for protection. The file management systems provides secondary storage management, that is, it allocates files to free disk blocks and manages free storage for available disk blocks.

Files

A file is a contiguous, logical addres space (apart from sparse files), however, they may not be stored as a contiguous file. Files may be of different types, but from the perspective of the OS, files have no structure - all a file is is a sequence of machine words and bytes. However, from the user or application perspective, files may have a simple record structure made up of lines (of fixed or variable length) or complex structures, which have special control characters in the structure to make up files of various types.

Files also have attributes assigned to them:

  • Name - in a human readable form
  • Type - for systems that support different types
  • Location - pointer to a file location on the device
  • Size - current file size
  • Protection - e.g., read, write, execute bits
  • Time, date and user identification - used for protection, security and usage monitoring

These file attributes are kept in the directory structure.

The OS only provides basic file operations on files, such as: create, write, read, seek, delete and truncate, so more complex operations must be made up of combinations of these operations, e.g., copy is a create, followed by a read from a source file and a write to the new file.

File types, traditionally implemented using file extensions (in Microsoft Windows, this is a legacy from DOS), enables the OS to invoke an appropriate application for each file type. UNIX and its variants use a "magic number" at the start of the file to declare its type.

File systems have criteria for file organisation, such as rapid access (especially needed when accessing a single data item), ease of update (not always a concern, however, e.g., CD-ROM), economy of storage (want to reduced wasted space, although this is less of a concern in modern computing), simple maintenance and reliability.

Sequential Access Files

In these types of files, access starts at the beginning of the file, and to read to the middle of the file, you must start at the beginning and read the file until you reach the position you want, and then rewind to get back to the beginning. This method of file access is slow.

A fixed format is used for structuring data into records. Records are normally the same length, and the fields in a record are in fixed order. One field (usually the first) is a key, which uniquely identifies the record within that file. Usually, records are stored in key order and batch updates are used to reduce the update time (records are cached and then ordered before an update is done). This method is designed for sequential access devices (tapes), but sequential files can be implemented on random access devices, such as disks.

Additions to and deletions from the file can cause problems. The file is stored in a sequential ordering of record keys, so insertion requires "moving" some of the records on the storage medium. This is very expensive in time. Usually, new records are placed in a new file, called the overflow, log or transaction file and a periodic batch update occurs that merges the overflow file with the original file. Alternatively, records are organised as a linked list. More overhead in terms of storage is required, and additional processing is required also. This only really works (on sequential storage devices) if the records remain in key order. Deletion requests can also be stored in the log file.

To reduce slow, sequential access files, and to support random access storage devices better, indexed sequential files can be used. The index provides the lookup capability to get in the vicinity of a desired record faster. The index is searched for a key field that is equal or less than the desired key value which contains a pointer to the main file. The search then continues in the main file by the location indicated by the pointer. The records are still stored in order, and new records are added to an overflow file (a record in the main file preceding the new record is updated to contain a pointer to the new record in the overflow file). The overflow is then periodically merged with the main file during a batch update.

Comparison: If a file contains 1,000,000 records, in a standard sequential file, the average lookup case is 500,000 accesses to find a record in the sequential file. If an index contains 1000 entries, it will take on average 500 accesses to find the key, followed by 500 accesses in the main file. On average, it is now 1000 accesses.

Direct Access Files

Direct access files exploit the capability of random access devices (such as disks) to access directly any address on the device. No sequential ordering of data in the file is required, and this method can be used where rapid access is required, such as in swapping or convential user files. It is not that useful for backups or archives where only occasional access is required (these can therefore be implemented on cheaper media, such as tapes).

Directories

Directories are used to organise files, and they themselves are files owned by the OS. Directories contain information about files, such as name, attributes, location, ownership, size, access/update times and protection information. They provide a mapping between file names and the files themselves.

From a user perspective, they need to search for a file (by name or attribute), create a file, delete a file, list a directory and rename a file, etc...

The directory must be organised to obtain efficiency (e.g., to locate a file quickly), naming (so it is convenient for users, two users can have the same name for different files and the same file can have several different names) and grouping (a logical grouping of files by properties).

There are multiple ways of implementing directories. One such way is the single-level directory, where a single directory exists for all users. However, if all users share the same directory, they must ensure that all file names are unique, and there is the grouping problem of little structure or organisation.

In a two-level directory setup, each user has their own directory, which allows a user to have the same filename as another user, this allows for efficient searching, but has the problem of there being no grouping capability between users.

Another is the tree-structured directory, which allows for efficient searching and grouping capability, without any sharing of files. In this model, we have the concept of a current, or working, directory. There are issues with this model, however. There are absolute and relative path names, and creating a new file or directory is done in the current directory. Also, how do you delete a directory? Most OSs require for you to explicitly state that you want to remove all files in the directory, or that you do it beforehand.

Tree structure prohibits the sharing of files and directories, however. Acyclic graphs remove this restriction. They are more complex than tree, and two different names can refer to the same file (aliasing), so there can be multiple path names for the same file. This does introduce a deletion problem, however - you can not delete a file until all references to it are solved. This is difficult, so we can solve it using backpointers (so we can delete all the pointers to a file), have backpointers in a daisy-chain organisation, or by using an entry-hold-count solution.

Acyclic graphs are complex to manage, however. You need to ensure there are no cycles, which links can introduce. Merely having subdirectories and files can not introduce cycles. Another way of guaranteeing no cycles is to only allow links to files, not subdirectories, or by using a cycle detection algorithm every time a new link is added to determine whether or not it is okay (this is expensive).

A file has to be opened before it is of any use, and similarly, a file system must be mounted before any files or directories are available.

Sharing of files on multi-user systems is desirable. Sharing can be done through a protection scheme. A simple method is to ensure that only one user has write access to a file at a time. A complex method is using a variation of semaphores implemented by some file systems (including in POSIX) called file locks. On distributed systems, files may be shared across a network.

Protection

Protection exists to make sure that files are not read or modified by unauthorised persons. Protection mechanisms are provided by the OS to safeguard information inside the computer. Objects and the rights that they have in particular contexts are identified in the aim to prohibit processes from accessing objects they are not authorised to access.

A domain is associated with a set of pairs <objects, rights>, where rights will be a subset of operations that are allowed on the object, and a domain could be a user or a process.

It is possible for an object to be in two domains at the same time with or without the same rights. Different mechanisms are used for identifying domains, in UNIX a combination of user ID and group ID is used.

Another method of providing protection is through access control lists and capabilities. It is abstract model, where matrix rows correspond to domains, and columns correspond to objects. The ACL defines columns - access rights for an object in different domains, and capabilities define rows, which are associated with each process and is a set of operations that are permitted.

An access matrix might look like this:

↓ Capability → Access List File 1 File 2 File 3 Card Reader Printer
neil read   read    
andy       read print
DomainX   read execute    
DomainY read/write   read/write    

The capability list is a row of the matrix, and a capability then is like an abstract data type, that is, it points to an object and specifies which operations are allowed on it (it may also have a field 'type' indicating what type of object it is). The list then specifies what capabilities are accessible in this domain.

In UNIX, the file owner (by default, the creator) is able to control what should be done, and by whom. Types of accesses include read, write and execute. In UNIX, this is implemented as a bitmask with three domains, owner, group and other. Groups in this case are created by a sysadmin, not the user.

File systems reside on secondary storage and contains both programs and data. It also contains swapped out pages as part of the virtual memory management system. The main issues to be addressed here are how to store files (a contiguous logical structure) on secondary storage (disk) and mapping high level file operations on to low-level disk operations, whilst requiring efficient use of the disk and fast read/write times.

Disks provide a series of blocks, usually of a fixed size, which are sometimes related to the memory frame/page size, often as a multiple. The number of file blocks can be loaded into one physical memory frame.

Disks are random access, that is, you can directly address any given block on the disk, and it is simple to access any file sequentially or randomly. For I/O efficiency, transfers between memory and disk are often performed in units of blocks.

For convenient access to disks, the OS imposes a logical file system, which defines how files look to the user (i.e., operations, file attributes, directories, etc...). Algorithms and data structures are needed to map the logical file structure onto the physical disk. This file organisation maps logical file structure onto physical block addresses. A basic file system issues generic commands to a device driver to read and write blocks. Finally, the I/O control contains device drivers to translate basic file system commands into commands for a specific disk.

Several structures are used to implement file systems, such as on disk structures including the boot control block, the partition control block, the directory structure to organise the files and a per file file control block (FCB) containing permissions, dates (access, create, modify), owner, group, size and blocks. In memory structures are also used, such as the partition table, the directory structure (with recently accessed directories) and an open file table (both system wide, and per process).

The above diagram shows the necessary file system structures provided by the OS to open a file (a) and reading a file (b).

Virtual File Systems

Virtual file systems exist so that a single interface can be provided by the OS to many different types of file systems. Different file systems can be mounted on different mount points within the same directory structure, and operations have the same effect on all the different file systems.

This logical file system is called the Virtual File System, or VFS and it allows the same system call interface (the API) to be used for different types of file system. The API is to the VFS interface, rather than any specific type of file system, and the VFS can distinguish between local and remote files.

Directory Implementation

The simplest implementation of directories is a linear list of file names with pointers to the data blocks. This is time-consuming, however (e.g., a search is required when a file is created to ensure uniqueness), as linear searches are expensive, however this cost can be offset by keeping frequently used directory information in a cache. Keeping an order on the names could reduce searches, but it would make it more costly for insert/delete of names.

Another implementation is to use a hash table, which decreases directory search time. There is a problem with collisions (situations where two file names hash to the same location) in this approach, however. A method is needed for knowing that a hash location has a number of corresponding file names, and for determining which name to use. This type of hash table is called a chained-overflow hash table.

File Allocation Methods

In contiguous allocation, each file occupies a set of contiguous blocks on the disk. This is relatively simple to implement, as only a starting location (block #) and length (number of blocks) are required. Random access is also relatively easy. This is wasteful of space (dynamic storage-allocation problem) due to fragmentation (see memory management earlier) and files can not easily grow.

A slightly more improved method is linked allocation, where each file is a linked list of disk blocks, which may be scattered anywhere on the disk, and where each block contains a pointer to the next block. Again, this is simple as only the starting address is required and there is no wasted space in this free space management system. However, there is no random access, as you have to go down the linked list to find the appropriate block.

The file allocation table (FAT) system is a variant of linked allocation used in MS-DOS and OS/2. A section at the beginning of the disk is dedicated to a FAT, containing an entry for each disk block. The directory entry then only needs the block number for the start of the file. The FAT then indexed by the block number gives the next block, etc...

Linked allocation solves the problems of contiguous allocation, but it does not support random access tables without a FAT. This can be achieved with an index table - all the pointers are brought together into an index block, which exists per file, where the ith entry points to the ith block of file, and this gives us random access. With large files, however, large index tables are needed, which is a disadvantage; multi-level indexing can be introduced to combat this, however. Additionally, the index table is an overhead.

Free Space Management

Since disk space is limited, space needs to be reused from deleted files and free space needs to be monitored. One way of implementing this is using a bit vector which shows when a block is free or occupied. An alternative to this bit vector is the linked list. A pointer is kept to the first free block, which contains a pointer to the next free block, etc... This is inefficient, as to traverse the list requires multiple disk block reads, however this is an uncommon operation, as most operations require just one block. However, the linked list does not require any extra free space, unlike the bit vector, although there is no easy way to get contiguous space easily.

These methods can be modified using grouping, storing the addresses of the next n free blocks in the first free block; next n addresses of free blocks in the next free block, etc... - this is more efficient in the number of blocks that need to be read; and counting, which takes advantage of the fact that free blocks usually form together. The first free block contains the number of free blocks immediately following (i.e., forming a contiguous free store), together with the address of the next free block at the start of some contiguous free store.

POSIX and Linux File Systems

In these implementations, the logical file system provides simple file operations, including: create, open, read, write, close, mount (mounting a file system at a mount point), stat (returns information about a file), lseek (repositions the current position in the file) and truncate (causes file to be truncated to a specific size).

This assumes a structure where all files have distinct absolute paths, using a tree hierarchy. It is possible to set up cycles via symbolic links

The VPS provides uniform access to all file systems, including special files (e.g., /proc, /dev).

The file system organisation is a multi-level indexing scheme, with an i-node (index-node) associated with each file, which contains attributes and disk addresses of the files' blocks. There is no distinction between files and directories other than having a different mode type field.

In an i-node, the first few disk addresses are stored (which for small files, will be all the disk addresses) and there are links to further i-nodes which contain further disk addresses. This may go up to further indirection, which gives us a very flexible system at the expense of complexity and storage space.

For directories, each entry contains a file name and an i-node number. Here, the i-node contains information about the time, size, type, ownership and disk blocks contained in the i-node. If you imagine that the file system knows the root directory and wants to look up /usr/neil/ops/exam, then from the root i-node, the system locates the i-node number of the file /usr, and then looks for the next component neil, and so on. Relative paths work in the same way, where there is a known current directory i-node.

The Ext2 Linux general purpose filesystem implements this, with a mostly static allocation of i-nodes. The disk is split up into block groups of one or more disk cylinders (i.e., those physically together) and blocks are used for i-nodes and file data blocks. Bitmaps are used for keeping data blocks and i-nodes in the group. When allocating disk blocks, related information is attempted to be kept together (i.e., i-nodes of all files in a directory in the same block group) and new directories are created in block groups with the smallest number of directories.

This scheme works well as long as there is > 10% free space.

With journalling file systems, files are stored in both disc and main memory (a memory cache of disk blocks is used for efficiency), however we have to consider the possibility of what happens when the system fails. On the disk, file system data structure may be corrupted, but we must ensure there is no loss of data (i.e., consistency is maintained). Journaling (or log-based transaction) file systems attempt to maintain consistency.

For a file operation, updates to the disk structure are written to a log, and when complete, the changes are considered committed. The updates for a file operation form a transaction, and actual updates to disk structures are performed so that if there is a crash, it is known how much of the transaction has been performed (and which other transactions remain in the log). ReiserFS under Linux is an example of a journalling file system.

I/O

Handling I/O efficiently is important to the user, and therefore to the OS. All user interaction with the machine is done using I/O and most user invoked operations (i.e., applications) will use I/O at some point.

Operating systems provide I/O device handling based on utilising low-level hardware interfaces to devices (i.e., access device via special instructions). Device drivers are provisioned here to handle the idiosyncrasies of individual devices, programmed in terms of a low-level hardware interface and provide a standard high-level application interface to devices, to enable their easy use by application software.

See ICS for information
on computer organisation

Devices are addressed either using direct mapping, where devices exist in a special I/O range and special I/O instructions exist in the CPU to implement this, or by memory mapping, where devices appear as addresses in the normal memory range and normal processor instructions are used to access these addresses.

Devices tend to be accessible in the same place for a particular architecture and each address range maps to the I/O control registers for the deviced accessed by special I/O instructions. Some devices have both direct and memory-mapped address space (e.g., graphics cards).

I/O ports typically consist of a status register, which contains the bits read by the host that indiciate device state (e.g., data ready for a ready), a control register which enables the host to change the mode of a device (e.g., from full to half-duplex for serial ports), a data-in register which is read by the host to get input and the data-out register, which is written to by the host for data to be sent to the device. This introduces the issue of how does the host know when a particular condition has arisen in the status register.

Polling is one way to solve this problem, and is a host initiated approach. The device status register is polled until the wanted condition occurs, but this way is inefficient as there is a busy-wait cycle to wait for I/O from a device. The alternative solution is using interrupts, which is initiated by both the host and the device. A CPU interrupt request line is triggered by the device and the interrupt handler receives interrupts and uses an interrupt vector to dispatch an interrupt to the correct handler based on priority. Most interrupts can be maskable to ignore or delay some interrupts, and the interrupt mechanism can be used for exceptions. There is no need for busy-waiting in this model.

In the IA32 architecture, interrupts 0-31 are unmaskable, as some are exceptions, and 32-255 are maskable and are used for devices.

For information on
DMA, see ICS

DMA is used to avoid programmed I/O for large data movement and is implemented using a DMA controller to bypass the CPU and transfer data directly between the I/O device and memory.

Application Interfaces

There are many different hardware devices, but application processes want a common interface, but devices can vary in many dimensions: character (deals with bytes or words of data, e.g., keyboards, mice, serial ports, etc and commands to deal with this include get and put - often libraries are implemented to allow in-line editing), stream (a stream of bits) or block (devices deal with blocks of data, such as disks, commands include read, write, seek and most raw I/O filesystem access uses this) transfer; sequential or random access; sharable or dedicated device; device speed; read-write, read-only or write-only. Timer devices also exist, which provide the current time, elapsed time, or just a standard timer.

I/O can be either blocking or non-blocking. That is, a process is suspended until I/O is completed under a blocking scheme (this is easy to use and understand, but is insufficient for some needs - although using threads blocking can be programmed around), or under a non-blocking scheme, an I/O call returns as much as available and returns quickly with a count of bytes read or written. A final scheme is asynchronous, which is difficult to use and is a scheme where a program runs while the I/O executes and the I/O subsystem signals the process when the I/O is completed.

Usually, hardware devices are controlled in the same manner as files (create, write, read, delete, etc...), so the filesystem namespace can be used for devices (/dev in UNIX). The kernel implements I/O through a series of layers as below:

The I/O subsystem of the kernel implements I/O scheduling and some I/O request ordering via a per-device queue, although some OSs do try to implement fairness. Additionally, buffering is also implemented, where data is stored in memory while it is transferred between devices (this copes with device speed mismatch and transfer size mismatch and "copy semantics).

Buffering can be implemented using many different methods, such as caching, where fast memory holds a copy of the data (always just a copy, and this is the key to performance), or spooling, where output for a device is held (this is useful is the device can only serve one request at a time, e.g., printing). Device reservation can also be used, where exclusive access is provided to a device and system calls for allocation and deallocation are used. However, deadlock (discussed above) can be encountered here.

The kernel has data structures for things like keeping state info for I/O components including open file tables, network connections and the character device state, as well as many, many other data structures to track buffers, memory allocation and "dirty" blocks. Some OSs use object-orrientated methods and message passing to implement I/O.

Some OSs use a file based approach to kernel I/O structures. Devices are named as files and accesses to devices are by file operations (with some exceptions, e.g., seek in a character device may have no effect). In Linux, an additional command ioctl() provides direct access to the device driver, perhaps to set the device into a particular mode (e.g., changing speed or other parameters of a serial port).

I/O Performance

I/O is a major factor in system performance, as it makes demands on the CPU to execute the device driver or kernel I/O code and it forces context switches due to interrupts. Data copying and network traffic are especially stressful.

Ways of improving performance include reducing the number of context switches, data copying and interrupts by using large transfers, smart controllers and polling. DMA can also be used and the CPU, memory, bus and I/O performance need to be balanced for highest throughput.

This brings us to the question of where should functionality be implemented to get best performance - application kernel or hardware level?