The Multics GE-645 DoS hardware bug that was never fixed
Have you ever heard about the "foobar" bug? How about the "garbage" bug? Actually, they're the same bug, and you probably never heard of it because it would take down the 645 Multics with one instruction.
It was a weekend in 1972, probably a Saturday and probably in late Spring or early Summer. The Multics group at 545 Tech Square had a good turnout of staff and grad students. I was sitting in a common terminal room on the fifth floor debugging yet another DIM. As I typed a test command, the clatter of the (IBM 2741) terminals ceased -- Multics had crashed. Nothing new there: all the computer systems of that era seemed to crash on a daily basis. Those of us in the terminal room dialed the Multics number on our modems and chatted as we waited for the modems to shriek indicating that Multics had come back up. When it finally did, I went back to testing and debugging. I glanced up at the last commands on the terminal paper and retyped them. After I typed the final command, the clatter again ceased. Multics had crashed again -- at exactly the same point in my typing. That was too much of a coincidence! I ran up to the ninth-floor machine room and talked with the operator. There wasn't much information available, and someone would have to look at the dumps. Fortuitously, the CompCenter staff was often in on Saturday morning. That day it was possibly Roger Roach. Whoever it was, he started the analysis based on my hunch that I had been responsible.
Meanwhile, I went back and looked over my session history. Finding an error in my test procedures, I corrected it and got on with my testing. At some point, I got a message: sure enough, my process was the active process on both crashes, but the crash data (particularly the SCU data) was strange and inconsistent. So, what had I been doing? I explained that I had typed garbage in place of the proper parameter. Eventually, we decided that I should try to reproduce the problem, and we arranged a future special session. At that time I logged in, typed my commands, and Multics went down like a rock.
So, what had I been doing?
First we need some context.
Almost every time-sharing OS has a command processor (aka command line interpreter). Putting aside fancy frills, such as star conventions or brace expansion, its basic purpose is to accept a series of characters and submit them to a "command" program for processing. On Multics, for example, the command processor accepts a line of characters and splits that line into multiple tokens based on whitespace separators. It then interprets the first token as a command name, creates an argument list out of the rest (if any), and invokes the command program. For example, given the line:
echo arg1 argument2
The Multics command processor creates an argument list declaring a first argument (arg1) of type character string with length 4, and a second argument (argument2) of type character string with length 9. It then locates the command echo (actually entry point echo$echo), fabricates a subroutine call to echo$echo using the two-element argument list, and executes a subroutine call sequence.
On Multics, of course, a command is just a subroutine whose parameters are all character strings (though, for simplicity, most real commands use the subroutine cu_$arg_ptr to get at their arguments). Still, the command processor can invoke any subroutine in any segment -- but it makes no sense to invoke a subroutine that isn't expecting character arguments.
Still, to avoid having to create lots and lots of special test driver programs, it would sometimes be useful to be able to directly invoke those other subroutines -- and provide them with the types of arguments that they are expecting. Dave Reed had recently created just such a command processor. It is named call, and it is implemented as a normal Multics command that reinterprets its arguments, much as the ioa_ printing subroutine, or the C++ I/O chains.
I've forgotten the exact syntax, but it is something like this: Using the example Multics command line:
call hcs_$initiate -s ">udd>m>dmw" -s "xxx" -s "" -n 0 -n 0 -o -p 0|0 -o -n 0
the call command will invoke the hcs_$initiate command in a similar manner as though we had used the PL/I statements:
declare hcs_$initiate entry (char(*), char(*), char(*), fixed bin(1), fixed bin(2), ptr, fixed bin(35)); call hcs_$initiate (">udd>m>dmw", "xxx", "", 0, 0, sptr, ecode);
Upon return, call would print the segment-pointer and error-code variables. The '-s' indicates that the next argument is a string; the '-n' indicates that the next argument is a fixed bin (35) number; and the '-o' indicates that the next argument is to be printed upon completion.
So, what had I been doing?
I had been testing my DIM with command lines similar to the following:
call ios_$attach -s "test" -s "DIM" -s "DEVICE" -s "MODE" -o -b72 0 call ios_$write -s "test" -s "foobar" -n 6 -o -n 0 -o -b72 0
where the relevant declarations are:
/* call ios_$attach (switch, dim, device, mode, status); */ dcl ios_$attach entry (char (*), char (*), char (*), char (*), bit (72) aligned); /* call ios_$write (switch, bufptr, offset, nelem, nelemt, status); */ dcl ios_$write entry (char(*), ptr, fixed bin, fixed bin, fixed bin, bit(72) aligned);
This was intended to attach my DIM to the DEVICE pseudo-device on the stream "test" and then write the string "foobar" (*) to the DEVICE..
(*) In MIT culture (see https://en.wikipedia.org/wiki/Foobar), the first nonce variable is generally named "foo", the next one is "bar", and the third is "foobar" or "baz". I usually started directly with "foobar".
I noted above that in my discussion with the CompCenter staff, I mentioned that I had typed "garbage" instead of the proper variable. As it happened, during the special session, I had typed:
call ios_$attach -s "test" -s "DIM" -s "DEVICE" -s "MODE" -o -b72 0 call ios_$write -s "test" -s "garbage" -n 7 -o -n 0 -o -b72 0
And Multics crashed.
So what was going on?
The problem was that I had made an error in formulating the argument list to call. The ios_$write subroutine actually takes a pointer second argument, but I had provided a string ("foobar" or "garbage"). This caused the hardware to interpret the string as a pointer. You will notice that both "foobar" and "garbage" have a "b" in the fourth character position. When the hardware is told to interpret either string as a hardware pointer (as my error in invoking call had done), the lower portion of the "b" is interpreted as the address modifier (TAG) field. The "b" (octal code 142) installed the octal code 42 into the TAG field, which in normal Multics operation contains an octal 43 (ITS pointer) or an octal 46 (fault-tag-2, used for dynamic linking). The octal 42, which is an undefined value, should generate a fault, which would be signaled to my program as an illegal operation -- but it didn't.
Based on the results of the special session and a bit of poring over the dumps and the hardware machine manual, it was clear that the underlying problem was in the hardware. So we turned the problem over to the Field Engineers. They scheduled an investigation during the next preventive maintenance period. After a couple of days they reported back their finding: I had encountered a flaw in the processor. As it was explained to me, when attempting to determine the cause of the fault, some parts of the processor thought it should do one thing and other parts thought it should do something else. Based on this, the processor was unable to create the proper SCU data. I don't think I was told whether the processor simply reset itself or generated a trouble fault (or something else).
Technical note. In retrospect, this makes sense. The TAG field was part of the original 635 design. The 645 architecture was added later, and address space for the various extension capabilities had to be clawed out. The ITS and ITB (indirect to base) actions have nothing to do with the 'indirect then tally' modifier from whose address space they were taken. Thus, the logic implementing both the ITS and ITB modifiers had to both perform the 645 function and suppress the 635 fault. That the designers would make such a design error in an error path was not surprising, and the fact that it hadn't been discovered earlier was also not surprising as this error path would not have been widely used. (Almost all pointers in Multics were created by the hardware, compilers, or well-exercised system utilities -- and thus were well-formed.)
The really odd thing to me was that my choice of nonce variables (foobar and garbage) was just happenstance, and the odds that they both had "b" as the fourth character and the "b" just happened to cause this fault were simply astounding.
In any case, the problem was definitely a processor design problem. The fix would require the services of the hardware engineers in Phoenix. But, they were all busy working on the new processor (the 655 or 6090 as it was then known). So, was it better for the hardware engineers to work on fixing the old processor and delay the new processor, or continue on the new processor and take the chance that there wouldn't be more problems with the old processor?
At this point a decision was made (at a level above my pay grade). It was noted that: 1) no one else had encountered this problem in the several years of use of the 645, and 2) we would be retiring the 645 processor in less than a year. Based on this it was decided that this bug would not be fixed -- and it never was -- and, as far as I know, no one else ever encountered it.
Posted 22 Feb 2016; updated 25 Feb 2016 with input from Chris Tavares, Gary Dixon, and Dave Jordan.