ACM Multimedia 97 - Electronic Proceedings

November 8-14, 1997

Crowne Plaza Hotel, Seattle, USA

Move-To-Rear List Scheduling: A New Scheduling Algorithm for Providing QoS Guarantees

John Bruno
Eran Gabber
Banu Özden
Abraham Silberschatz: Bell Laboratories; 600 Mountain Avenue; Murray Hill, NJ 07974; USA; Email: {jbruno, eran, ozden, avi}@research.bell-labs.com

ACM Copyright Notice

Abstract

In order to support multiple real-time applications on a single platform, the operating system must provide Quality of Service (QoS) guarantees so that the system resources can be provisioned among applications to achieve desired levels of predictable performance. The traditional QoS parameters include fairness, delay, and throughput. In this paper we introduce a new QoS criterion called cumulative service. The cumulative service criterion relates the total service obtained by a process under a scheduling policy to the ideal service that the process would have accumulated by executing on each resource at a reserved rate. We say that a scheuling policy provides a cumulative service guarantee if the performance of the real system differs from the ideal system by at most a constant amount. A cumulative service guarantee is vital for applications (e.g., a continous media file service) that require multiple resources and demand predictable aggregated throughput over all these resources. Existing scheduling algorithms that guarantee traditional QoS paramaters do not provide cumulative service guarantees. We present a new scheduling algorithm called Move-To-Rear List Scheduling which provides a cumulative service guarantee as well as the traditional guarantees such as fairness (proportional sharing) and bounded delay. The complexity of MTR-LS is O(ln(n)) where n is the number of processes.

Keywords

Quality of service, cumulative service, resource reservation, scheduling algorithm

Introduction
Background and Related Work

Link Scheduling
Scheduling in Operating Systems

Servers, Processes and QoS Parameters
Move-To-Rear List Scehduling

The Move-To-Rear List Scheduling Policy
Properties of the MTR-LS Policy

Summary
End Note
References

Introduction

New multimedia applications, which require support for real-time processing, are pacing the demand for operating system support for Quality of Service (QoS) guarantees. The desire to support multiple real-time applications on a single platform requires that the operating system have the ability to provision the system resources among applications in a manner that achieves the desired levels of predictable performance. Moreover, computer networks are starting to provide QoS guarantees with respect to packet delay and connection bandwidth. These QoS guarantees are of little use if they cannot be extended to the endpoint applications via operating system support for related QoS parameters. Current general-purpose multiprogrammed operating systems do not provide QoS guarantees since the performance of a single application is, in part, determined by the overall load on system. As a result many users prefer to use stand-alone systems with limited dependency on shared servers to achieve some semblance of QoS by indirectly controlling the system workload. Real-time operating systems are capable of delivering performance guarantees such as delay bounds, but require that applications be modified to take advantage of the real-time features. Our goal is to provide QoS guarantees in the context of a general-purpose multiprogrammed OS, without modification to the applications, by giving the user the option to provision system resources among applications in order to achieve the desired performance levels.

In this paper we introduce a new QoS parameter, which we call the Cumulative Service. Guaranteeing cumulative service is vital for applications that require multiple resources and demand a predictable aggregate throughput over these resources. We present a new scheduling algorithm, called Move-To-Rear List Scheduling (MTR-LS), for allocating operating system resources, such as CPUs, disks, and network interfaces, among competing processes. We show that MTR-LS provides strict guarantees regarding the cumulative service obtained by a process. MTR-LS also provides guarantees for more traditional QoS parameters such as fairness and delay.

Our results are applicable to a general model of processes. However, for ease of exposition we define a process to be a sequence of phases (Section 3) where each phase consists of a requirement from a particular system server. It is assumed that neither the phase requirements nor the exact sequence of phases are known ahead of time. Our model allows for new processes to dynamically enter the system and for processes to depart the system. Processes specify their QoS requirements by providing a ``service fraction'' for each system resource. Admission control consists of ensuring that the service fractions of new processes along with the service fractions of processes already in the system do not exceed certain prescribed limits. Admission control is necessary if we are to provide delay and cumulative service guarantees that are independent of the number of processes in the system.

Another feature of our work is that it deals with system services that have limited preemptability. For example, disk I/O occurs in multiples of a basic block size and once a transfer is begun, the next scheduling event can not occur before the transfer is complete. However, the granularity for CPU scheduling is taken to be arbitrary.

Our cumulative service QoS parameter is new. The cumulative service criterion compares the total service accumulation of a process at a server to an ideal service accumulation based on the service fraction of the process for this server (we will use resources and servers, interchangeably). We say that a scheduling policy guarantees cumulative service if the total service obtained over any time interval does not fall behind the ideal service accumulation, based on the service fraction, by more than a constant amount. In our ideal system we assume that a processes gets its specific service fractions of each of the resources without interference from other processes. This corresponds to a processor sharing model in which the servers are not overbooked, i.e. the sum of the service fractions at each server is not greater than one. Recall that we model processes as a sequence of phases where each phase corresponds to a service requirement from a particular server. In spite of the fact that the delay incurred by a phase of a process at one server is propagated to subsequent phases of the process at the same server and other servers, we will show that the MTR-LS policy provides a cumulative service guarantee over every interval of time.

The QoS parameters supported by many existing scheduling algorithms include proportional sharing of the CPU among competing processes and delay bound guarantees [8, 11, 5, 10]. However, they do not support a cumulative service guarantee since neither fairness nor delay bounds are sufficient to provide a cumulative service guarantee. Informally, guaranteed cumulative service means that the scheduling delays encountered by a process on various resources do not accumulate over the lifetime of the process.

The cumulative service criterion is essential for providing aggregate service for applications that require the sequential use of multiple resources such as CPU and disk. It is especially vital for such applications that act as services. A good example is a continuous media (CM) file service, which shares system resources with other applications or services, that itself is in a position to provide certain QoS guarantees to other applications. A typical implementation of a streaming CM file service assumes that it can sustain a certain amount of disk I/O throughput. However, it also needs CPU for tasks such as initiating disk requests, and managing its buffers, etc.

Example 1: To make our point, let us consider the following overly simplified CM server: 1) An entire disk drive with an effective bandwidth of 40 Mbps and half of the CPU cycles are reserved for this service. 2) In between every 1 MB of disk I/O, the CM server requires 1 ms of real CPU time to dispatch disk requests and manage the buffers. Since it reserved half of the CPU, it expects that the execution time for each CPU phase will take at most 2 ms. Similarly, since it reserved the entire disk, it expects that each disk phase (i.e., 1 MB disk I/O) will take 200 ms. Thus, the CM server expects a disk I/O throughput of 1MB/202ms= 39 Mbps. The CPU delay bound provided by existing algorithms [11, 5, 10] depends upon, among other things, the granularity of scheduling quanta. The granularity cannot be too small for otherwise the context switching overhead would become excessive. It would not be unusual to have a CPU delay bound of 75 ms. That is, the CPU scheduler promises that each CPU phase of length t will complete within t/ 0.5+75 ms. The t/0.5 results from reserving half of the CPU and the 75 ms is due to an average delay of 75 ms for other processes to get their time-slice. Suppose that each CPU phase of the CM server waits 75 ms for the other processes contending for CPU before it executes for 1 ms on CPU. A 76 ms delay for 1 ms CPU execution is still within the delay bound promised by the CPU scheduler. However, the disk I/O throughput of the CM server will degrade to 1MB/276ms = 28 Mbps! This is 30% less than expected. Under these circumstances, the CM server could not support 26 streaming MPEG-1 sessions, each requiring a 1.5 Mbps throughput (26 x 1.5 Mbps = 39 Mbps).

The main contributions of this work are the introduction of the cumulative service criterion as an important QoS parameter for OS scheduling and the MTR-LS policy which provides a cumulative service guarantee and guarantees for other QoS parameters including fairness (proportional sharing) and delay bounds.

The remainder of the paper is organized as follows. QoS parameters and theoretical developments in terms of deterministic performance bounds are well developed in the link scheduling literature. In the next section we review some of the relevant aspects of link scheduling since they are closely related (both in terms of similarities and differences) to their counterparts in operating systems. Following this, we discuss the QoS parameters that play a role in our work. In Section 3 we introduce our system and process models and definitions of QoS parameters. in Section 4, we present a new scheduling policy, called Move-To-Rear List Scheduling, that is fair and provides a cumulative service guarantee. Finally, Section 5 summarizes our work.

<-- Table of Contents

Background and Related Work

<-- Table of Contents

Link Scheduling

The QoS issue has received much attention in the packet scheduling literature [2, 13, 4, 12, 6] where packets are identified with ``flows'' and there is a, conceptual, queue per flow or collection of flows. The correspondence between a sequence of packets and a flow is application-dependent, but we assume that we can identify packets belonging to a particular flow.

The objective for the link scheduler is to determine the order in which packets are scheduled for transmission on the output link so as to achieve various performance measures. We review some of the QoS parameters for output link scheduling and contrast these with QoS support for processes in the operating system context where there are important similarities and differences.

We assume that when a packet arrives at a node a routing table lookup is performed to determine the appropriate output link and a (possibly concurrent) flow identification lookup is done to determine the appropriate queue. Although there is a delay associated with these steps, they tend to be uniform over the packets. We take the arrival of a packet at a queue as an instantaneous event corresponding to the placement of a record, which gives the specifics of the packet data, in the appropriate queue data structure.

There are roughly two kinds of packet sources, open-loop and closed-loop. For an open-loop packet source the times-of-arrival of packets are not affected by the delay experienced by packets that have already arrived. A closed-loop packet source is one for which the times-of-arrival may be affected by the delays experienced by packets that have already arrived. Open-loop sources include unacknowledged transmission of UDP packet streams. Closed loop sources include TCP/IP flows.

The QoS parameters in link scheduling include delay, delay jitter, throughput, and fairness. Packet delay can be measured in a number of ways. The delay can be measured from the time that the packet arrives at its queue until it is transmitted over the link. This measure of delay depends on the behavior of the traffic source and the scheduling policy. Another possibility is to measure the delay of the packet from the time it reaches the head of its queue until it is transmitted over the output link. This measure of delay does not depend on the characteristics of the traffic source. Yet another possibility is to assume that each flow is provided a certain bandwidth reservation and calculate the departure time of each packet based on the assumption that all the packets in the flow are serviced at the reserved bandwidth rate. The delay of a packet is the difference between the packet's actual departure time and its calculated departure time (may be negative). This delay measures the real system's ability to match the performance of an idealized system.

Delay jitter is a measure of the variability of packet delay. The minimization of delay jitter is important in situations where the receiver buffers packets from a flow and plays them out at a constant rate. The more variable the jitter, the larger the buffer needed to regulate the flow for smooth playout.

Throughput is a measure of the rate at which the data from a flow is transmitted over the link. Determination of the throughput of a flow depends on the time scale used to measure the transmission rate. For example, the instantaneous transmission rate of data in a flow has two possible values, 0 and the link transmission rate (i.e., either packets from the flow are being transmitted on the link or not). The measurement of throughput can be tricky since we are interested in an average of the instantaneous transmission rate over some interval of time. A flow is said to be backlogged whenever the corresponding queue contains packets to be transmitted on link. The evaluation of the average transmission rate for a flow is usually taken over a period of time during which the flow is continuously backlogged and is given as the number of bits transmitted over an arbitrary interval during the busy period. Again, if this period is too small, then we could see large fluctuations in the transmission rate. Over longer periods of time, including several periods during which a flow is backlogged, the throughput for a ``stable'' flow will match the average arrival rate of the flow (the definition of stability). Otherwise, we are in a situation in which the queue for a particular flow can grow without bound and buffer overflow becomes the primary concern. To ensure stability and also to meaningfully discuss packet delay (the interval of time from which the packet enters the system until it is transmitted), one has to make assumptions regarding the nature of the packet sources such as their average and peak bandwidths. Often the assumptions regarding source characteristics are enforced (shaped) by an appropriate regulator such as the, so called, leaky bucket regulator which effectively constrains the average bandwidth and the ``burstiness'' of the traffic.

Fairness is a measure of how close the server comes to Generalized Processor Sharing (GPS) [9] when serving simultaneously backlogged flows. GPS is an idealized model in which the capacity of the link is assumed to be infinitely divisible and can be shared among an number of flows simultaneously as long as the capacity of the link is not exceeded. The primary purpose of the fairness criterion is to isolate the behavior of one flow from another.

With any QoS support, it is necessary to have a way of prescribing the level of support that is required. Much of the research in link scheduling assumes that the QoS requirement for each flow is specified by a single number having the interpretation of a (dimensionless) weight or a rate. Suppose flow i has an associated weight given by a positive real number and C denotes the capacity of the output link in bits per second (bps). Then the usual interpretation of the 's is that, ideally, whenever a flow i is backlogged it will transmit packets at an average rate which is no less than (bps). The determination of QoS parameters, such as delay, delay jitter, throughput and fairness, does not directly follow from the weights of backlogged flows but depends on the total weight of backlogged flows, and on the specific algorithm used by the link scheduler. This indirect way of specifying QoS requirements makes admission control difficult. Also, using weights to guarantee delay bounds results in underutilization.

The delay and delay jitter requirements can be somewhat decoupled from the throughput requirements by using regulators (shapers) which release packets to the link scheduler at the appropriate times. Such regulator-scheduler combinations can result in policies that may not transmit packets even when packets are available for transmission. Delay-earliest due date [3] provides delay bounds independent of the bandwidth guaranteed to a flow, however, at the cost of reserving bandwidth at the peak rate.

<-- Table of Contents

Scehduling in Operating Systems

We next turn to QoS parameters in operating systems (OS). Unlike link scheduling, in the operating system context there are multiple resources, such as the CPU, disks, and network interfaces, that are shared among competing processes. Each of the servers (resources) is capable of delivering ``work'' at a certain rate. For example, a CPU executes instructions at 100 Million Instructions Per Second (MIPS), a disk can transfer a data block (512 bytes) in 12 milliseconds, and a network interface can deliver bits on its output link at 10 Million bits per second (Mbps).

Processes in the OS context roughly correspond to flows and are modeled as a sequence of phases where each phase consists of the name of a server and the corresponding amount of ``work.'' For example, (CPU, 100 million instructions) is a phase that specifies the CPU as the server and the work which consists of executing 100 million instructions. The amount of time taken by the phase depends on the rating of the CPU server-a 100 MIPS CPU will take 1 second to complete this phase. At the completion of a phase, the process will move to the next phase which consists of a new server and work requirement. In the actual system, the phases (server and work requirements) are not known in advance.

Since we model processes as a sequence of phases, any delay incurred while completing a phase will be propagated to all subsequent phases. Thus we think of a process as a ``closed-loop'' source. Conversely, if a process gains a time advantage by getting extra service, this advantage is also passed on to subsequent phases. One could also consider ``open-loop'' processes in which the time-of-arrival of a phase is independent of previous phases. For example, if the phases of a process were generated by interrupts, it is possible for a new phase to arrive before the previous phase completes its service. We do not consider ``open-loop'' processes in this paper.

In the link scheduling environment the scheduler transmits one packet-at-a-time over the link. We assume in the OS context that each server has a preemption interval which specifies the temporal boundaries where preemptions may occur. This means that the ``granularity'' of sharing is determined by the properties of the server and the scheduler.

In order to specify QoS requirements, we associate with each process and each server a reservation, called a service fraction, which gives the amount of the server required by the process. For example, suppose a process has a .25 reservation on the CPU. In the case of a 100MIPS CPU, this means that the process needs at least a 25MIPS CPU to meet its performance objectives. The weights, , defined in the section on link scheduling determine service fractions as ratios . We choose to use service fractions since they reflect, in absolute terms, the service requirement of the process and thereby simplify admission control.

The performance objective most readily specified by reservations is a cumulative service guarantee, which means a guarantee that the real system will keep pace with an ideal execution based on the server reservations. For example, suppose a process reserves 20% of the CPU and 50% of the disk I/O subsystem and suppose the CPU is rated at 100 MIPS and the disk I/O subsystem can do single block (4 Kbytes) transfer in 12 milliseconds. According to the reservation, this process should ``see'' at least a 20 MIPS CPU and a disk I/O subsystem capable of transferring a single block in 24 milliseconds. Suppose the process alternates between CPU and disk I/O phases where each CPU phase requires the execution of 4 million instructions and each disk I/O phase consists of 6 random block transfers. Accordingly, the process should take no more than 200 ms for each CPU phase and 144 ms for each disk I/O phase regardless of the number of process phases and competing processes.

Other QoS parameters for processes can be defined. If we were to associate the phases of a process with the packets of a flow we obtain the following notion of delay. The delay of a phase at a particular server is the cumulative time spent by the phase either waiting for the server or running on the corresponding server. It is not difficult to see that guaranteeing delay bounds (i.e., bounding the time it takes to complete a phase) is not sufficient to provide a cumulative service guarantee. This is because the phase delays can accumulate (in the closed-loop case) over multiple phases leading to an unbounded discrepancy between the actual and the ideal cumulative service. For example, phase delays on one server may reduce the the service rate of other servers in a closed loop system.

We also define the notion of fairness which measures the ability of the system to ensure that processes simultaneously contending for the same server will ``share'' that server in proportion to their reservations. Fairness in the OS context, sometimes referred to as proportional sharing [11, 5, 10], is problematic since the cost of providing fairness (context switching) increases as the granularity of server sharing decreases. It is also not clear that fine-grain sharing is always desirable in a general-purpose operating system, particularly for ``batch'' processes where coarse-grain sharing is acceptable and substantially reduces the context switching overhead.

Recently proposed scheduling algorithms which are most closely related to our work are: Stride scheduling [11], which attempts to provide each process with a share of the server in proportion to its corresponding weight (number of ``tickets''); start-time fair queuing [5], which is based on the corresponding link scheduling algorithm [6]; and earliest eligible virtual deadline first [10], which provides each process with a share of the server in proportion to its corresponding weight; and the CPU scheduling policy presented in [8], which provides each process with its reserved share. These algorithms were not designed with our cumulative service measure in mind, so it is not surprising that the properties they do enjoy are not sufficient to provide a cumulative service guarantee.

In the next section we introduce our system and process models and give formal definitions of the terms used in the remainder of this paper.

<-- Table of Contents

Servers, Processes, and QoS Parameters

A system consists of a collection, , of servers (e.g., CPU, disk, and network). Each server is characterized by a service rate and a preemption interval, . If is an amount of work to be accomplished by server s, then the time to complete on server s is . When a ``process'' is run on a server with a positive preemption interval , the running time must be an integral multiple of and the process can only be preempted at integral multiples of . The limiting case of corresponds to a server for which running times are arbitrary and preemptions are not restricted.

A phase is a server-duration pair, (s, t), where and t is the amount of time it would take server s to complete the phase running alone on the server. An equivalent definition of a phase (s,t) is a server-work pair, , where .

A process is a sequence (finite or infinite) of phases, . The phases of a process are not known in advance. Initially, the only thing we know about a process is the identity of the first server, i.e., . By running P on server we will eventually learn , the duration of the first phase, and , the server required for the second phase, i.e., . By running P on server we eventually discover and the server required for the third phase, viz., , and so on.

Let denote the sequence of times that processes enter the system. The departure time of a process depends upon the process and the scheduling discipline. We assume that each process has a (possibly infinite) departure time. A process is active at time t if and the departure time of process is greater that t. Let denote the set of indices of active processes at time t.

We assume that each process , before being admitted to the system, specifies a service fraction for each server s, namely, . We require for all that . That is, the sum of the service fractions of all active processes with respect to server s does not exceed 1.

Even though we are interested in the performance of our system over all servers it is sufficient to study the performance at a single server. From the point of view of server s, a process is denoted by a sequence of phases that alternate between server s and elsewhere, i.e., or . The ``elsewhere'' server represents the phases of processes at servers other than s.

A process arrives at server s if it enters the system at server s or it requires server s after completing a phase at the elsewhere server. The process arrival times at server s depend upon the duration of the phases and the scheduling policies used at the various servers. A process leaves server s whenever it completes a phase at server s. When a process leaves server s it will either eventually depart the system or arrive at server s for another phase.

Since we are considering the performance at a single server, say s, we can drop references to server s since it is understood. Therefore B denotes the service rate of the server, denotes the preemption interval of the server, and denotes the service fraction of process .

When a process is run on a server it is assigned a maximum running time which we call a quantum. The scheduling algorithm is not required to use a fixed size quantum. Scheduling decisions are made at points in time called decision epochs which correspond to the expiration of the current quantum, the completion of the phase of the current running process, or the end of the current preemption interval following the arrival of a process at the server. In the latter case, if the arrival of a process occurs at time while the server is in the midst of a preemption interval, , then the scheduler must wait until , that is, the decision epoch occurs at . At each decision epoch, the current process can be preempted and the the scheduler can assign a new process to the server.

Realizable scheduling policies require that we run at most one process at a time on the server. This means that if there is more than one process waiting to run on the server, then one or both of the processes will experience (queuing) delay. Although we do not show the dependency of the following quantities on the scheduling policy, it is important to keep this dependency in mind. Let be an arbitrary real time interval. We define and to be the cumulative real waiting time (blocked by other processes running on the server) and real service time (running on the server), respectively, obtained by process in the interval . For we define and .

Let and for define . By definition, is the total time spent by process at the server in the interval

The quantities just defined are illustrated in Figure 1. The heavy line denotes the accumulation of waiting time and service time of a process at the server. The slope of this curve is either 1 or 0 depending on whether the process is at the server or elsewhere, respectively. From the figure we see that the process arrives at the server at time 4 and leaves at time 17 for elsewhere. The process arrives again at time 23. The lower lighter curve, labeled with 1, represents the service obtained by the process (an explanation for the curve labeled with 1/2 is coming). This curve has slope 1 or 0 depending on whether the process is being served or waiting, respectively. Since the process leaves at time 17, it means that the service requirement of the phase is 6. Intervals on the time axis are labeled with e, w, or s, depending on whether the process is elsewhere, waiting, or being served, respectively. We can read off the following quantities from the figure: r(4,17) = 13; w(4,17) = 7; s(4,17) = 6; r(12,23) = 5; w(12, 23) = 2; s(12, 23) = 3.

To evaluate the performance of our scheduling algorithm, we introduce a processor sharing model in which the server can run any number of processes simultaneously as long as the sum of their service fractions does not exceed one. In the processor sharing model, processes do not block one another since they can run simultaneously on the server, albeit at a reduced rate. We refer to the service time in the processor sharing model as virtual service time.

A process with service fraction that receives t units of real service time would take virtual service time units to obtain the same amount of service under processor sharing running at rate . Conversely, a phase taking v units of virtual service time to complete under the processor sharing model requires real service time on the real server.

Let denote the cumulative virtual service time obtained by process in the interval . For we define . Note that and for .

Figure 1: Example of Cumulative Service.

In Figure 1 the curve label 1/2 denotes the virtual service time accrued by a process with a service fraction equal to 1/2. Accordingly, the slope of this curve is either 2 or 0 depending on whether the process is being served or not, respectively. Therefore the phase of duration 6 (from time 4 to 17) requires 12 units of virtual service time to complete.

Definition 1: We say that a scheduling policy provides a cumulative service guarantee if there exists a constant K such that for all processes and , we have .

From Figure 1 we have: v(4,9) = 0; r(4,9) = 5, and v(9,14)= 6; r(9,14) = 5, and v(4,23)= 12; r(4,23) = 13. Considering the portion of the process illustrated in Figure 1, we conclude that for all and .

Another interpretation of the cumulative service guarantee is that the total real time taken to provide a process with service (including the waiting time and the service time) is no more than a constant amount of time more than the virtual service time required for an equivalent amount of real service.

Although the definition of cumulative service guarantee is in terms of a single server, it implies a ``global'' cumulative service guarantee (using cumulative virtual service time and cumulative real time over all servers) in the multi-server case where there is a constant number of servers.

Example 2: Consider a process P= , , , , . that requires servers and , and reserves and fractions of and , respectively. Also, let and be the cumulative service bound on servers and respectively. In a system that guarantees cumulative service, the total real time for the servers to provide [ _i=1^n2 t_2i-1 + _i=1^n2 t_2i ] time units of service (waiting plus service time) is bounded by [ 1 _i=1^n2 t_2i-1 + 1 _i=1^n2 t_2i + K_1 + K_2 . ] This is in contrast to other scheduling policies (e.g., [8, 11, 5, 10]) that provide delay bounds on a per phase basis. In this case the discrepancy between the cumulative service obtained and the time to acquire this service can grow with n, the number of phases.

Definition 2: We say that a scheduling policy provides delay bound if, for any process the real waiting time plus service time to complete a phase of duration d takes at most a constant amount more than .

A ``fair'' scheduling policy ensures that multiple processes requiring the same server share the server in proportion to their reservations, independent of their previous usage of the resource [2]. That is, a fair scheduling policy does not penalize a process that utilized an idle server beyond its reservation when other processes become active on that server.

Definition 3: A scheduling policy is fair if there exists a constant D such that for any time interval during which a pair of processes, and , both continuously require the server, we have .

Processor sharing provides ideal fairness [9]. However, processor sharing cannot be implemented and thus a host of scheduling policies that aim to provide bounded fairness and/or delay properties have been devised for link scheduling [2, 13, 4, 5]. As we mentioned earlier, the cost of providing fine-grain fairness (proportional sharing [11, 5, 10]) is high and not always justified in the OS context. However, fairness is important when services are overloaded and it is necessary for all impacted processes to make steady, and proportional, progress.

<-- Table of Contents

Move-To-Rear List Scheduling

In this section we present a new scheduling policy, called Move-To-Rear List Scheduling, which provides a cumulative service guarantee, is fair, and has bounded delay. In the following subsection we present the MTR-LS policy followed by a subsection which contains formal statements and proofs of the properties of the MRT-LS policy.

<-- Table of Contents

The Move-To-Rear List Scheduling Policy

Central to the MTR-LS policy is an ordered list, , of the processes that are active at any time. We say a process on the list is runnable if it is not elsewhere. The MTR-LS policy services the runnable processes in the order that they appear on the list .

The MTR-LS policy makes use of a constant T, which we call the virtual time quantum. Associated with each process on the list is a value . The initial value of is . When processes are serviced, they run for a quantum which is bounded by the value in . At the end of the service period, is decremented by the actual amount of service time that the process obtained and if the result is zero, then is moved to the rear of the list and the value of is reset to . The value is the real time quantum. A process that advances by real service time, advances by T virtual service time.

The service obtained by a process can be less than the allocated quantum due to the termination of a phase or the arrival of a process. In the former case, the phase terminates, the process goes elsewhere, and the first runnable process on is serviced next. In the latter case, if the arriving process is ahead of the current running process in the list , then the running process is preempted (as soon as the preemption interval permits) and the first runnable process on is serviced next.

The description of the MTR-LS policy depends on a few mechanisms which we give next. Whenever a new process, , enters the system it is added to the end of and is set equal to where T is a system constant. As long as the process is in the system, whether it is at the server or elsewhere, it appears in the list . Whenever a process departs the system it is removed from .

Whenever all the processes in are elsewhere the server is idle. Otherwise the server is running a process and the state is busy. Decision epochs correspond to the expiration of the current quantum, the completion of the phase of the current running process, or the end of the current preemption interval following the arrival of a process at the server. In the latter case, if the arrival of a process occurs at time while the server is in the midst of a preemption interval, , then the scheduler must wait until , that is, the decision epoch occurs at .

The command wait causes the scheduler to ``sleep'' until the next decision epoch. Whenever a process starts running, a timer, called elapsed, is started from zero. elapsed can be used to determine the service obtained by the currently running process.

Run_a_Process
if there is no runnable process on the list  then
    state = idle;
else
    Let  be the first runnable process on the list ;
    state = busy;
    run  on the server for at most  time units
      (current quantum) and start elapsed timer;
wait;

Figure 2: Routine Run_a_Process.

The routine Run_a_Process, shown in Figure 2, is called to select the next process to run on the server. Run_a_Process looks for the first runnable process on the list . If the list does not contain a runnable process then the server state is set to idle and the scheduler waits for the next decision epoch. Otherwise the first runnable process on is selected and run for a quantum of at most time units. The server state is set to busy and the scheduler waits for the next decision epoch. The variable elapsed will record the elapsed time to the next decision epoch.

The Move-To-Rear List Scheduling policy is shown in Figure 3. The MTR-LS policy is called at each decision epoch. It determines if a process was running in the interval leading up to this decision epoch by checking to see if the state is busy. If so, it decrements the corresponding by the elapsed time since the previous decision epoch. If the resulting value of is zero the corresponding process is moved to the end of the list and is reset to .

Under the MTR-LS policy there are two ways for a runnable process to be blocked. First, it can be blocked by runnable processes ahead of it on the list . Second, for servers with a positive preemption interval ( ), a runnable process can be blocked by processes that are behind it on the list . This happens when a process arrives at the server while another process is running and in the midst of a preemption interval. If the arriving process is ahead of the running process in the list , then the arriving process will be blocked at least until the end of the current preemption interval. This kind of blocking is called -blocking. It is important to notice that if a process is -blocked then, because of its position in the list , it will obtain service before the process that caused the -blocking returns to service.

Initialization
For each process , which is active at time 0, put on the list (in any order) and set =
Run_a_Process;

The Method (Runs at each decision epoch)
Decision epochs correspond to the expiration of the current quantum, the completion of the phase of the current running process, or the end of the current preemption interval following the arrival of a process at the server

if state == busy then
    Let  be the current running process; 
     =  - elapsed;
    if  == 0 then
        Move  to the rear of the list ;
          = ;
Run_a_Process;

Figure 3: Move-To-Rear List Scheduling.

A straight-forward implementation of MTR-LS stores the runnable processes in in a heap [1]. When a process is moved to the rear of it is given a new largest timestamp. Arriving processes and runnable processes that are moved to the rear of the list are inserted into the heap in O( ln(n) ) time where n is the number of runnable processes in . The runnable process with the smallest timestamp (corresponds to the first runnable process in the list ) can be found in constant time. It takes O(ln(n)) time to rebuild the heap when the first process is removed from the heap and is no longer runnable (i.e., gone elsewhere).

Lemma 1: The complexity of MTR-LS is O(ln(n)) where n is the number of active processes.

<-- Table of Contents

Properties of the MTR-LS Policy

The MTR-LS policy provides a fairness guarantee whose ``granularity'' depends on T, the virtual quantum. Unlike other QoS parameters, the fairness guarantee does not depend on the length of the preemption interval or whether the sum of the service fractions is less than or equal to one.

Lemma 2: The MTR-LS policy is fair with a bound of 2 T. That is, for any real time interval during which and are both continuously runnable [ |s_i(, t)/ _i - s_j(, t) / _j| 2T . ]

Proof: The worst case situation occurs when one of the processes, say , is ahead of on the list and is extremely small ( ). It is possible for to gain T units of virtual time, then for to run for units of virtual time and be placed at the back of the list. If were to run for another T units of virtual time, then the accrued virtual time of would be ahead of the accrued virtual time of by a value which is bounded by 2T.

It is easy to see from the proof that the MTR-LS policy supports proportional sharing for processes with arbitrary, non-negative service fractions. However, the cumulative service and delay guarantees are dependent upon the service fractions and the length of the preemption interval. In the next subsection we treat the case of a zero-length preemption interval followed by a subsection on the case .

The Case

Throughout we assume that we are using the MTR-LS policy. Assuming , denotes the value of the variable at real time t.

The following is the basic lemma for the case .

Lemma 3: Assume and . Then for all we have

Proof: The idea of the proof is to construct a scenario, , whereby process obtains service only after being blocked to the maximum extent possible by the other processes. It will follow from the construction that for any other execution in which obtains the same amount of service, the amount of blocking by other processes will be no greater that . We show that in this extremal case that the lemma holds and therefore it holds for all other executions. The notation means that the second argument has some appropriate value. We use this notation to avoid having to create new symbols for these values.

The execution is designed to insure that is blocked from running to the maximum extent possible. With this in mind, we assume that at time all processes with and are ahead of in the list . Also assume that all of these processes run to their maximum extent, while preventing from running. Thus , the real time spent by process waiting to run, without running, is bounded by . At this point we have: [ r_j(,) (1-_j) T + v_j(,) ] From this point on cannot be blocked further until obtains real service time. At the point where obtains service time we have: [ r_j(,) (1-_j)(T - left_j()/_j) + v_j(,) ] Now is at the back of the list and so the worst case is for all the active processes to run to their maximum extent while blocking from running. At the point when reaches the head of the list and can no longer be blocked we have: [ r_j(,) (1-_j)(2 T - left_j()/_j) + v_j(,) ]

Continuing as above, after every time advances by T the other processes can only block by at most . Since the real running time is , the virtual time advance equals the worst-case real-time blocking and service times. Thus the above inequality is the worst case for the execution .

Finally, it is easy to observe that for any other execution the amount of blocking by other processes for the same amount of running time for is no larger than the scenario given above. The lemma follows.

Corollary 1: The MTR-LS policy provides a cumulative service guarantee.

From Lemma 3 we can write for all j and . We obtain this worst-case bound by setting and . For the case where we get since .

Corollary 2: The MTR-LS policy provides bounded delay.

Proof: Consider a phase of process that requires work. The duration of the phase is and the allotted virtual time is equal to . Let be the beginning of the phase and t the time the phase ends under the MTR-LS policy. Using Equation 1 we get [ r_j(,t) /(_j B) + (1-_j)(2 T - left_j()/_j) . ] This equation states that the amount of real time taken to complete this phase is at most a constant amount of time more than times the duration of the phase.

The Case

Assume that the server has a positive preemption interval ( i.e., ). As we described above, when the preemption interval is positive, we have to deal with the case in which a process arrives at the server while the server is running process and is in the midst of a preemption interval. In this case the next decision epoch occurs at the end of the current preemption interval.

For example, let's consider a system in which there are two processes and . In this system, process runs continuously on the server and process alternates between the server and the elsewhere server. We also assume that whenever process runs on the server it runs for exactly one preemption interval and then goes to elsewhere for a while. Due to incredible bad luck, whenever process arrives at the server, the server is running process and is exactly at the beginning of the current preemption interval. Therefore process has to wait (this is -blocking) real time units before running on the server. This sequence of -blocking can persist for time units before is put to the rear of . In addition to the bad luck with -blocking, whenever is put at the end of the list , process blocks for time units. Thus we find that, , the real time spent at the server in pursuit of T units of virtual time service, can be as large as . From this it appears that is sufficient to have a cumulative service guarantee for process on the server. Interchanging the roles of and , we get that is sufficient to have a cumulative service guarantee for process on the server.

As this example shows, in order to have a cumulative service guarantee we need additional restrictions on the service fractions associated with servers having a positive preemption intervals. The following result gives a sufficient condition on the service fractions associated with the server such that the MTR-LS policy supports a cumulative service guarantee.

Lemma 4: Assume . Assume that for all t and all we have . Then for all we have

Proof: The proof on this lemma is similar to the proof of Lemma 3. We construct a scenario, , whereby process obtains service only after being blocked to the maximum extent possible by the other processes. In addition to the blocking used in the proof of Lemma 3, this construction uses -blocking (See Step 2 of the MTR-LS policy) to increase the amount of blocking that process incurs. It follows from the construction that for any other execution in which obtains the same amount of service, the amount of blocking by other processes will be no greater than . We show that in this extremal case that the lemma holds and therefore it holds for all other executions.

The execution is designed to insure that is blocked from running to the maximum extent possible. Assume that at time all processes with and are ahead of in the list . Also assume that all of these processes run to their maximum extent, while preventing from running. Thus , the real time spent by process waiting to run, without running, is bounded by . At this point we have: [ r_j(,) (1-_j) T + v_j(,) ] In the proof of Lemma 4.2.1 process could no longer be blocked until is obtained real service time. However, due to the positive preemption interval of the server, process can suffer -blocking while attaining service time. Since process has been blocked up to this point it has to run a minimum of before it goes to the elsewhere server. When process arrives from the elsewhere server we assume that some other process is in service and that is -blocked for at most time units. Then runs for real time units and again goes to the elsewhere server. This pattern repeats until acquires units of processing time at which time we have: [ r_j(,) (1-_j) T + (2 _j - 1)left_j()/_j + v_j(,) . ] Once again all the other processes are ahead of on the list and can block up to time units. The result is that is once more at the head of the list and we have: [ r_j(,) (1-_j)2 T + (2 _j - 1)left_j()/_j + v_j(,) . ] From this point onward, is -blocked every time it gains T units of virtual time and then blocked by all the processes until it is reaches the head of the list. Every iteration results in a gain of T virtual time for process during which it acquires [ _j T - t + T _i A() _i ] real time ( -blocking, ordinary blocking, and service time). By assumption, the above quantity is less than T and the lemma follows.

Corollary 3: Assume . Assume that for all t and all we have . Then the MTR-LS policy provides a cumulative service guarantee.

By setting and we get for all j and . In the case , we have

Corollary 4: Assume . Assume that for all t and all we have . The MTR-LS policy provides bounded delay.

<-- Table of Contents

Summary

The major contributions of this paper are as follows: We introduced a new QoS parameter--cumulative service--for operating system scheduling to provide predictable performance for applications requiring multiple resources. The system we treat is dynamic, namely, processes enter and depart the system. Existing scheduling policies [8, 11, 5, 10] that provide delay or fairness do not guarantee cumulative service.

We have presented a new scheduling algorithm, called Move-To-Rear List Scheduling, which provides a cumulative service guarantee. Our results show that the MTR-LS policy provides a fairness bound of 2 T, where T is the virtual quantum of the server, regardless of the choice of service fractions. In a recent paper by Lund, Phillips, and Reingold [7], they mention an output link scheduling policy for ATM cells called ``Ideal Round Robin'' which is related to MTR-LS.

They do not analyze the properties of ``Ideal Round Robin'', but instead they use an approximation to ``Ideal Round Robin,'' called Fair Prioritized Round Robin, as a benchmark for comparison with a fair scheduling policy for an input-buffered switch [7].

The provision of other QoS parameters requires certain restrictions on the service fractions. These requirements provide the constraints for admission control. When arbitrary preemption is allowed ( ), we require that the sum of the service fractions is less than or equal to one. When preemption is allowed only at a multiple of the preemption interval ( ), we require that for all j, , where is the set of active processes at time t. This condition is easy to check since one need only check the inequality for the largest service fraction. As long as the above conditions on the service fractions hold, the MTR-LS policy meets the cumulative service guarantee within 2 T and the delay bound is 2 T in the worst case. The implementation complexity of MTR-LS is O(ln(n)) where n is the number of processes.

References

1: T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. McGraw-Hill Book Company, 1990.
2: A. Demers, S. Keshav, and S. Shenker. ``Design and Analysis of a Fair Queuing Algorithm''. In Proceedings of the ACM SIGCOMM Austin, Texas, September 1989, September 1989.
3: D. Ferrari and Verma D. C. A scheme for real-time channel establishment in wide-area networks. IEEE Journal on Selected Areas in Communications, 8(3):368-379, April 1990.
4: J. Golestani. ``A Self-Clocked Fair Queueing Scheme for Broadband Applications''. In Proceedings of the IEEE INFOCOM Toronto, June 1994, pages 636-646, June 1994.
5: P. Goyal, X. Guo, and H. M. Vin. ``A Hierarchical CPU Scheduler for Multimedia Operating Systems''. In Proceedings of the USENIX 2nd Symposium on Operating System Design and Implementation Seattle, Washington, October 1996, October 1996.
6: P. Goyal, H. Vin, and H. Chen. ``Start-Time Fair Queueing: A Scheduling Algorithm for Integrated Services Packet Switching Networks''. In Proceedings IEEE SIGCOMM'96, August 1996.
7: C. Lund, S. J. Phillips, and N. Reingold. Fair Prioritized Scheduling in an Input-Buffered Switch, pages 258-369. Chapman & Hall, London, 1996.
8: C. Mercer, S. Savage, and H. Tokuda. Processor capacity reserves: Operating system support for multimedia applications. In Proceedings of IEEE International Conference on Multimedia Computing and Systems, May 1994.
9: A. K. Parekh and R. G. Gallager. A generalized processor sharing approach to flow control in integrated services networks-the single node case. IEEE/ACM Transactions on Networking, pages 344-357, June 1993.
10: I. Stoica and et.al. ``A Proportional Share Resource Allocation Algorithm For Real-Time, Time-Shared Systems''. In Proceedings of IEEE Real-Time Systems Symposium, December 1996.
11: C. A. Waldspurger and W. Weihl. Stride scheduling: Deterministic proportional-share resource management. Technical Report TM-528, MIT, Laboratory for Computer Science, June 1995.
12: H. Zhang. Service disciplines for guaranteed performance service in packet-switching networks. Proceedings of the IEEE, 83(10), October 1995.
13: L. Zhang. ``Virtual Clock: A New Traffic Control Algorithm for Packet Switching Networks''. In Proceedings of the ACM SIGCOMM Austin, Texas, September 1989, September 1990.