The Multics Scheduler

Multics Close ⊗

Bob Mullen, CISL, Cambridge, 1974. Click for a larger view.

Introduction

The Multics design distinguishes between dispatching a CPU and scheduling processes, but the implementation is all done in a single module. Dispatching switches a CPU from one process to another; this should be as efficient as possible, so we precompute a list of eligible processes to switch to. Scheduling computes the order of this list. Jerry Saltzer's 1966 PhD thesis, Traffic Control in a Multiplexed Computing System defined this distinction, and introduced the concepts of processes, block, and wakeup.

The implementation of the Multics scheduler is in the module pxss, Process Exchange Switch Stack. Its ALM source code is available online. [THVV]

Eligibility

In order to be dispatched, a process has to be in the eligible queue. Among other things, this means that the two essential pages needed for execution are present in memory. Scheduling policy mostly amounts to deciding which process to make eligible next, if indeed any at all, and how much CPU time the process can use before losing eligibility.

Processes in the eligible queue are typically in the states

running -- is now executing on some processor
ready -- could run if a CPU was available
waiting -- for some imminent event, an unlocking, a page-read

If a process calls block, for example to wait for TTY input or some other long-term event outside of the kernel's control, it is removed from the eligible queue, and not threaded in any queue. If a process's time slice ends it is threaded in some ready-but-not-eligible queue described below.

Ordinarily processes are added to the tail of the eligible queue. They tend to float toward the top as other processes lose eligibility. An IO bound process (use little CPU, do many waits...) tends to wind up at the top of the queue, a desirable situation: One wants the processes that consume CPU time to be at the bottom of the queue, to soak up CPU, so the CPU does not idle during IO bursts.

Foreground-Background Scheduler

The first Multics scheduler, circa 1967, was descended from the one in use on CTSS. It was sometimes described as a "Greenberger-Corbató exponential scheduler." The scheduler had N queues, something like 6. When the scheduler wanted to choose a process to run, it took the one at the head of the highest priority empty queue. A process that started requesting service would be put at the tail of the highest priority queue. Processes from that queue were given a small time slice, called the QUANTUM. If the timer went off and the process was still running, it was sent to the tail of the next lower queue, and when processes from that queue were run, they got a time slice twice as long (thus, exponentially longer time slices). The lowest priority queue was called the "background" queue. Thus, long-running jobs would not run until higher priority jobs were done; the scheduler would then run these background jobs for long blocks of time. A job that blocked on disk I/O was skipped over but didn't lose its position in queue; a job that blocked on terminal I/O started over at the top queue when it received an input wakeup. There are more details, but that's a general idea. [THVV]

Percentage Scheduler

In 1975, Bob Mullen implemented the Multics "Workclass" scheduler, originally done to provide different percentages of the machine to different groups of users. It was not that each user got a few percent, but that a group was to get some total percent. (Scheduling of time for users within the group was done by what is called an FB-n algorithm, meaning n levels of foreground and background priorities.)

This scheduler was overlaid on the original FB-n scheduler, in the sense that scheduling within a group was done by the FB-n scheduler. The workclass scheduler decided to make a process "eligible" based on its workclass being the most worthy; the ready queue for each workclass was sorted as an FB-n queue.

The administrative interface for the workclass scheduler was written by Tom Casey and the Scheduler/Dispatcher algorithms by Bob Mullen. The percentage allocations could be varied by shift. There could be 16 workclasses. The initial proposal for this set of changes was in MTB-193 Priority Scheduler (1975-05-19).

Objective

Allow a site administrator to assign each user to one of 16 groups, and to assign a percentage of the machine to each group. The percentages add up to 100 percent. Each Multics project is assigned to one of the groups, and so all user processes derive their work class assignment at login.

The minimal expectation is that if all groups are presenting an offered load which exceeds their allotment, the amount received will match what was specified. Other implicit expectations are assumed and drive parts of the implementation described below:

Assume that when some groups are not presenting sufficient demand, other groups are allowed to use the time. In other words we never idle to constrain a group.
The time unused is divided among groups able to use it, roughly in proportion to their original allotments.
A key question has to do with the time scale at which the control takes place: how often, and more importantly the integration period on which the control decision is based. An extreme example, in a time sharing system at 3pm, a group should not be affected by how much time it used in the morning. To refine that, consider that time sharing users expect a response time of about a second and use about a second --- and if not run within a second will perhaps understand the decision if it is based on time used by them or their group in the present second or the last few. They will not accept that they cannot run now because during an idle period 20 seconds ago their group got lucky and chewed up their share of the next several minutes.

(If your group is allotted 40% of the machine and it has used 40% of the last second and is likely to use 40% of the next second, and yet you cannot run --- your gripe is with the admin/budget person that sets your group's percent, not with the algorithm.)
Even if the scheduler does not make hard commitments of future time allotments, one should consider the decision to begin running a job from a certain group as increasing the spend rate of that group in the near future. In the case of a time sharing user being given a time slice, one does not know for sure whether the user will use the whole time slice or not. However there is the possibility the user will. The concrete application of this consideration can be stated simply as: If the machine is idle at times S & T which are arbitrarily close together (= if bandwidth becomes available at S & T etc.), and if there are users from several groups available to run and if one of the groups is most deserving of time at time S and gets it, then it is NOT true that at time T a user from the same group should be run. That is: one could easily imagine an algorithm which, based on the near identity of the system states at times S & T, made the same decision at both times --- it would be wrong. The fix is to immediately decrement the credits of the group whose member was selected at time S by some amount, for example the time slice awarded. Then if there is an immediately following allocation of run time, a member of the same group will only be selected if that group remained most worthy, after the subtraction.

Obviously at some point the arbitrary subtraction should be added back in, and the amount of time actually used should be subtracted instead.
One must be careful not to ruin the response time of the groups with small percentages. They are already at a disadvantage due to elementary queuing theory results if the percentage scheduler works close to spec. A group assigned 10% that offers a load of 15% will perceive poorer response than a group assigned 60% that offers a load of 65%. This is understandable without queueing theory. A group assigned 10% that offers a (random) load of 9% will see worse response than a group assigned 60% that offers a load of 54%; this follows from queuing theory, assuming service times are the same for users in the two groups. What must be avoided are "lumps" in the algorithm that make things worse than they need to be. (Note the 9% user could buy 11% and greatly reduce his response time under heavy load.)

Idealized Algorithm

Suppose A is the time slice that can be awarded, or what should be roughly the same, the responsiveness desired from the system.

As the time is used by anyone, add time credits to all groups in proportion to each groups assigned percentage. Thus in general, for every second of time used by anyone, a total of one second worth of credits are distributed. (SCATTER CREDITS)

Decide which job to run on the basis of which group has the most credits at the moment.

As jobs run, subtract time credits from their group. This and SCATTER CREDITS essentially conserve the number of credits in the system.

SCATTER CREDITS, the run decision, and the subtraction of time used for the using group address the primary goal, percentages specified by the admin.

When a job is started running, subtract some fixed amount A from its group's credits. When the job is done, add A back into its group's credits. These two operations, on average, conserve the number of credits in the system. Together they address point 4 above.

A group's credits can never exceed 4*A, nor go below 0. Thus SCATTER CREDITS says group.credits = min (group.credits + delta, 4*A). And as a job runs, group.credits = max (group.credits - used, 0). These two operations destroy and create credits, and limit the integration time of the system. If for some extended period only one group demands (and gets) service, the other groups will not build up more than 4*A credits, and the active group's credits will not go below zero. Thus if other groups suddenly became active the first group will be back in the ball game fairly quickly, more quickly if it is a 60% group, less quickly if it is a 10% group. This addresses point 3 above, a short integration time (4*A); also point 2, if some group is not using its credits, additional credits are passed out to demanding groups on the same ratio as originally specified.

One could argue that the max # of credits (the cap) should be proportional to the groups' percentages, rather than the same across all groups. There are two problems with that. First it would mean the integration time, the memory of the system, could be longer for some groups than others, making behavior less predictable. Second, it would work to the advantage of the larger groups to allow them a larger buildup of credits. It will be (relatively) better for the smaller groups to not do that --- this addresses point 5 above. Note this is only an issue if there is some measurable idle time. If there is no idle time, the lower % group will have no advantage.

Compromise for Responsiveness

It is a result of queueing theory that if you place 25% of the users in 4 queues each assigned 25% of the cpu time, the response time for all users will be 4 times worse than if they shared a single queue running at 100% (under comparable heavy loads). This implies that schedulers dividing up cpu-time --- or bandwidth of any kind --- cannot avoid drastically degrading response time. Because this phenomenon was visible during testing, an additional queue was created for ready non-eligible process that had just interacted. This "interactive-queue" was examined before the workclass queues and allowed all users to receive their initial time-slices in first-come-first-served order regardless of recent usage by their workclass or the percentage assigned to it. Of course the credits for their workclass were decremented in the same manner as above. Because the first time slice was short, it was rare for a workclass to exceed its percentage due to this loophole. On the one hand this maintained good response time to simple commands; on the other, response to longer commands is not subject to stringent user expectations. Without the "interactive-queue" the rest of the percentage scheduler would have been doomed by elementary queueing theory. Even today it is worth asking of any new scheduling algorithm, "What does it do to average response time?"

Deadline Scheduler

In 1975-1976, Bob Mullen added features to the scheduler to

provide more efficient support for daemons driving physical devices such as printers.
allow precise tuning of workloads for competitive benchmarks.

The version of the scheduler is called the "Deadline Scheduler."

The deadline scheduler used the workclass structures to implement a wholly different scheduler in which neither the FB-n nor the percentage parameters were used. Considering that most people do not understand the FB-n algorithm, the top level view was that the scheduler could be operated in either "percent" mode or "deadline" mode.

Using some workclasses with virtual deadlines along with a few with realtime deadlines was a convenient way to tune benchmarks with response time requirements which varied for different user scripts. However realtime workclasses can also be used in conjunction with percentage groups, and such a use was common at customer sites.

Virtual Deadlines

Here each workclass was given two pairs of numbers; R1 and R2 could be though of as response times, and Q1 and Q2 were time quanta. When a process received an interactive wakeup the scheduler read the clock and set

  process.deadline. = clock() + workclass.R1

Then the process was sorted by deadline into a queue of ready (but not eligible) processes from all workclasses. The process at the head of this deadline queue would be the next process to be made eligible. There was no attempt to meet the deadline, nor to delay running until the deadline; the deadlines were just a way to sort the work to be done. Thus the deadlines were virtual, meaning not real.

When the process became eligible it was given a quantum of Q1, and placed at the end of the eligible queue. When its time slice expired it was resorted into the ready queue, after another clock reading:

  process.deadline. = clock() + workclass.R2

When it was made eligible again, it would receive a quantum of Q2. The allocation of R2 responses and Q2 quanta continued until the next interactive wakeup.

Realtime Deadlines

Realtime deadlines were not deadlines for completion of work, but for the start of work. If a workclass was identified as realtime its processes were sorted into a special "realtime" ready queue by deadline. Until the deadline arrived no new eligibility would be granted to a realtime process, unless there was nothing else to do. However when the deadline arrived, a realtime process was moved to the head of the eligible queue (ignoring any restriction on max_eligible, working set size, etc.). Subsequent deadlines and time slices were computed as for virtual deadlines, the sorting and dispatching used the realtime ready queue.

Remarks on Realtime

In general it would be a mistake to have too many realtime processes. The assumption was that the administrator would not have too many and the amount of processor time committed to them (asymptotically at most Q2/(R2+Q2) fraction of a processor, less due to IO waits...) would not be large. For example to keep a double buffered line printer busy that required 50 ms of CPU time to fill a buffer that would generate 2 seconds of printing, one would set:

  R1 = 1.9 sec        R2 = 5.0 sec
  Q1 =  .06 sec       Q2 =  .05 sec

One needs to know that the refresh-buffer wakeup from the printer driver to the printer process gave interactive credit to the process. Thus under ordinary operation the printer process would start running within 1.9 sec and complete running before using .06 sec. For about 3% of a processor the printer would never halt if there was work to do. If the printer went into a loop, or someone tried to use the printer process to do a compile, the process would receive additional time slices very slowly, for a rate of 1% of a processor. Similar applications included RJE lines and backup processes.

The present author would use a realtime workclass for himself in the early stages of tuning a benchmark, in order to never be locked out of the system due to poor response time.

The View from Multics Marketing

Harry Quackenboss gave a presentation on Priority Scheduling at the Multics Users' Group at HLSUA XXV, on 18 Oct 1977. He writes:

[HVQ] Of all the things that I did with Multics, the Workclass Scheduler definition with Bob Mullen was one of the most enjoyable, and I think significant contributions to Multics sales.

The Multics system purchased by the Universities of Bristol and Bath in the U.K. might have happened anyway, but the ability to allocate guaranteed percentages of the system CPU and other resources was pretty important. This was a highly unusual situation in that Bath, with a much smaller computer acquisition budget, wasn't able to purchase a Multics system on their own. I don't know whether it was the Honeywell sales team, led by Neil Aldred who fostered the idea, or whether the two universities decided on their own to pool their budgets, but being able to pool their funds to buy one system, housed at Bristol, with remote access by Bath, and guarantee that Bath was going to get their fair share was essential to the sale, and also a pretty big leap of faith by Bath. As often happened with University users, the more the faculty and graduate students learned about Multics, the more they liked what they saw.

This also represented the first U.K. Multics installation, and was crucial to Honeywell Information Systems Ltd. to making the investment in sales, technical support, and heavy pre-sales expenses to respond to RFPs and run complex benchmarks.

In the 1970s, various big benchmarks often required tests such as "run 80 interactive users with a mean response time of 2 seconds maximum, and complete 80 batch jobs in an hour." The work class scheduler was a much more straightforward tool for matching the system behavior to particular workloads than what Multics had previously, and what competitors had to do to meet the benchmarks.

The scheduler also became important to other university sales where Multics was offered as both an academic computing platform, supporting widely varying demands based on what professors and researchers wanted to do, but also allowing the administrative computing users to be assured that the academic side would not hog the machine.

But there is another chapter to the story. Much of the impetus to implement this capability to guarantee resources to particular user groups came as a request from General Motors, the first commercial customer. When I decided to join Honeywell as part of the support team on the GM Multics installation, I had interviewed with GM to work on the same system. The argument that swayed me was made by the Honeywell systems manager: If you want to be on the leading edge, go to work for the vendor. I decided years later this wasn't always true, and there were several examples where Multics customers led Honeywell in its use. But at the time it was a powerful argument.

At the time I joined the Honeywell team, the internal group at General Motors responsible for the Multics system justified the purchase in large measure as a response to the rapid growth of GM's computing costs going to outside time sharing services. The biggest of those users was a group in General Motors' marketing research and analysis section, that reported to my father.

When I interviewed with Honeywell, one of the interviewers was concerned about the politics and my loyalties, but it never once caused a problem. However, my dad went to the time sharing service he was using and told them unless they came up with a new pricing scheme, he was going to be cutting back their business. He and his group had pretty strong game theory backgrounds, and proposed to the supplier that they could commit to a certain volume of business, as long as they could get better pricing, and could use otherwise idle CPU while still paying a flat rate. Their supplier came up with a much simpler scheme than the workclass scheduler, but it nonetheless worked pretty well for them. By delaying their lunch time, or coming in early, the marketing staff users could actually get better response time, while lowering their total bill.

With this approach the outside service made the case to other GM departments that they were actually the more economical choice. In response, the GM Multics team asked Honeywell for a way to offer a similar capability.

It took about forty years, but in the modern era of virtual machines, you can see similar but harder to use, less elegant mechanisms being employed to achieve similar goals.

Written: 08/21/95
modified: 03/04/00 with info from Bob
modified: 01/26/16 with info from Harry