Tapes for Phoenix: the GIOC bug that was never fixed
Progress was in the air in 1972. The Multics system was a proven success as a research project; now it was time for Honeywell to turn it into a commercial product. Although the project management had been shared by MIT and Honeywell (and earlier Bell Labs) from the beginning, many of the core development functions were still handled by MIT staff. As part of this evolution of Multics, Honeywell had produced a new hardware generation for the Multics system, the 6180. It was generally recognized that the switch to the 6180 was a good time to pass the baton of leadership to Honeywell. I was a staff member of CSR (the Computer Systems Research division) of Project MAC, an employee of MIT. My normal job was working with computer networks, mostly the connection of ARPANet to Multics, but I had been volunteered to assist with the 6180 upgrade on an as-needed basis.
Converting from the 645 Multics to the 6180 version was a long and convoluted process. The 6180 built on the 645's basic architecture, but much was changing:
- the ring architecture was different (and now mostly in hardware);
- the 645 had 4 pointer registers: the 6180 would have 8;
- the 645 used the GIOC for I/O: the 6180 would use the standard Honeywell IOM and DataNet communication processor;
- the 6180 would use a newer generation of disks;
- the 645 paged to a drum: the 6180 would use a bulk store;
- the reconfiguration hardware had been reengineered based on a painful history of botched dynamic reconfigurations.
Steve Webber was the software lead for the hardcore changes; John Gintell was the project manager. Both worked at Honeywell CISL. The porting team comprised most of the Honeywell Multics team and a few members of the MIT Multics team. Some of the changes were reasonably straight forward. The CISL compiler group modified the PL/I compiler for the new processor. Bob Mabee (Project MAC) modified the assembler. It was possible to plug an IOM into the 645, and many of the device drivers had already been debugged in Cambridge. The same was true for the bulk store. The 4 base registers on the 645 were "unlocked" and made into 8.
Still, there would likely be problems with bringing up the actual new CPU. Once the diagnostics for the new machine ran properly, the software team would have to begin the OS bring-up for the new machine. The Multics boot-load mechanism used a few Hollerith cards which eventually read the 6180 version of the hardcore operating system into memory from an MST. Changes to that operating system were performed by modifying the source code, building a new system, writing a new MST, and then boot-loading the processor from that tape.
The Multics software team was in Cambridge MA, and the MIT Multics machine was in the 545 Technology Square building; the hardware was being designed and built by Honeywell in Phoenix AZ. Since it was likely that there would still be bugs to be fixed in the hardware (and in the software), members of the Multics software team would have to travel to Phoenix for system bring-up. The complication was that the tool-chain for building boot-load MSTs ran only on Multics, and Multics ran only on the 645 processor. The problem was that 645 processors were extremely scarce: all the 645 processors in the world were in daily service for real, paying customers -- and Phoenix didn't have one. So, how was the team going to bring up the new machine?
The plan that emerged was that various of the hardcore software engineers would rotate though Phoenix. They would fly out in turn to help debug the parts of the kernel with which they were most familiar. These engineers could dial up the Multics system in Cambridge, make their fixes, and generate an MST boot tape. But, a system in Phoenix couldn't boot from a tape in Cambridge, so how could we get the tapes to Phoenix in a timely fashion?
The first answer was that the tapes could be transported to Phoenix via airline courier service. Once a tape had been written in Cambridge, it could be driven to the Boston airport, where it would be placed on a commercial flight to the Phoenix airport, where it would be delivered to the Honeywell Phoenix facility. Door to door, it might be done in less than seven hours -- but only if the heavens lined up with the the airline's flight schedule.
The second answer was me.
One day in early summer of 1972 as I remember, I was handed a couple of manuals and informed that I was to develop a GRTS remote workstation emulator for Multics. I would then use that emulator to transmit the data on an MST to a GCOS system in Phoenix where it would be written to tape. Moreover, because the MST was not a "standard" tape (i.e., not in IBM nor ANSI tape format), it didn't seem to be possible to spool the data on the GCOS server. Thus, I would have to submit commands to copy the entire MST tape, which comprises multiple tape "files."
GRTS had a reasonable remote job entry (RJE) workstation interface. It was similar to many others, such as IBM's 2780. A user created job cards, which logged into the GCOS host and then copied data from one peripheral to another peripheral, usually a remote GRTS card reader to GCOS data file and then GCOS data file to remote GRTS printer. It communicated via the bisync protocol using a synchronous modem. It was common to use a Bell 201A modem, which transmitted data at the then astounding speed of 2000 bps. Assuming normal loading, this would allow a data transfer rate of about 225 cps -- about 800K bytes in an *hour*. This still wasn't quick turnaround, but it would be faster than the airplane. Also, during the first stages of the bring-up, we might be able to omit the (as yet) unused portions of the MST.
I was still relatively new to the Multics project, but I wasn't a bad choice for this programming task. Before moving to the Multics project, I had hooked up an IMLAC PDS-1 graphics workstation to Multics via a home-brew communications protocol. Also, I was probably the only person still on the Multics staff who had created and used a Multics program using a synchronous communications protocol.
I should expand a bit on that. Since moving to the Multics project I had discovered the PDP-8 that Jerry Grochow had used for his thesis, The graphic display as an aid in the monitoring of a time-shared computer system, MAC-TR-54 (S.M. thesis). (Described in The Instrumentation of Multics) Jerry had investigated performance issues of Multics using the PDP-8. To do that the PDP-8 was connected to the Multics GIOC via a synchronous communication line -- which was still connected and operational. Furthermore, the PDP-8 had D-A (digital-to-analog) converters hooked up to stereo speakers and a powerful DEC 338 graphics display. Bob Mabee and I had taken to playing with the PDP-8 in our off hours, producing music and interesting display patterns: Bob worked mostly at the PDP-8 end; I worked mostly with the Multics end. More specifically, I had written DIMs and application programs to talk to the PDP-8.
It would be a while before I could actually test my program against the real GCOS server (because it often took several weeks to get a modem line installed in those days). Nevertheless, I set to work. First, I elected to pretend to be a Honeywell G-115 remote workstation. Then I discovered that the manuals didn't have enough detailed information on the communication protocol. (I remember ordering a manual on bisync from IBM). I also did some research on how other RJE systems worked. Finally, I started coding. I created an interface whereby the emulator accepted a file that mimicked the normal GRTS G-115 job deck (nothing special there) and augmented the command language to include the ability to read and/or write from/to I/O streams, such as the tape DIM. I even wrote a simple dummy server so that I could test some of the code.
Eventually the phone line and modem were ready. (I seem to remember that Dave Vinograd was responsible for the logistics.) And then I started real debugging. The first issue was that the phone hookup was flaky. There was no ACU (automatic call unit) and I had to manually dial each call. There were two problems with that. First, the modem had to be close to the MIT Multics 645 computer. There were no Multics terminals close to the modem, so I had to start my program and then walk some distance to dial the modem. By the time I got back to the terminal, the entire process would usually have completed, and I was left to stare at the terminal output and then to pore through the trace output(*) from various probes that I had included. Second, only about a quarter of the calls would actually go through. Most of the time, the line would click and buzz a bit but never actually connect to a modem at the other end, and I would have to redial until I got an actual connection.
(*) When debugging communication systems, it's often necessary to use tracing (e.g., PRINT statements) because such systems usually have timeouts enforced by parts of the system not under control of your debugger, in this case, the GCOS server bisync controller. Thus, any pause to use a debugger results in a timeout -- and an end to the session.
Still, I was actually able to start testing my code. The first problem was that the GCOS server didn't understand any of the data frames that I was sending. That was quickly solved after a quick examination of the data traces: I had followed the IBM bisync spec, which used EBCDIC; GRTS, however, used ASCII control characters. With that solved, I started to make good progress. Soon, I was able to submit a FORTRAN program to the GCOS system, run it, and retrieve the listing and output. Now it was on to debug the code that would write a tape at the GCOS end. There was a bit of complication, but that was also soon solved.
Finally, I was ready for the real test. I mounted an MST on a local Multics tape drive and submitted a job deck that would:
- ask the GCOS operator to mount a tape,
- copy the data to Phoenix, and
- write an image of the MST tape on the tape in Phoenix.
After finding a few bugs in my code, I was able to mount the tape and copy several records -- and then the Phoenix GCOS system would hang up on me. This happened repeatedly as I added successively more and more tracing info to my emulator. Eventually, I discovered that the hang-up was preceded by a series of requests for a repeated transmission of a particular record. Apparently, GCOS thought that my transmission was being corrupted by line noise, and it would terminate the session when it thought that the line was too noisy to use.
This was a real problem. How could I diagnose this? I scoured my various manuals. Had I computed the checksum wrong? Had I included the wrong characters in the checksum? Eventually, I convinced myself that the frame format and my checksum algorithm were correct. So, was there something wrong with the data going out? Was the line really too noisy? If so, why did all the previous data frames go through correctly?
There were no good tools to help, either. There were as yet no such things as DataScopes. There didn't seem to be anyone at the GCOS end who could look at the data that I was sending. The data frame was much too long to be able to put it up on an oscilloscope.
Eventually, I remembered the PDP-8. Now that I had a trace (from the software side) of a frame that was causing the problem, maybe I could look at the actual transmitted data by sending it to the PDP-8. At least that might help me partition the problem one side or the other of the modem.
The first thing that I had to do was to create a test jig to send bisync frames to the PDP-8 and look at them there. After I got that working, I could switch the communication cables to be able to look at the output of the line card that I was using to talk to the GCOS system. So, with Bob Mabee's help, I sent bisync frames over the line to the PDP-8. Most frames were transmitted correctly, but when I tried the problematic frame, I could see that the data received by the PDP-8 had been corrupted. Pay-dirt! Apparently both synchronous line cards on the GIOC had a problem sending certain data -- perhaps it was a hardware bug. All right, this was a problem I could work with. And, so I did. Bob and I set up an echo mechanism between Multics and the PDP-8. What I sent from Multics would be returned for examination. After much investigation and jiggering of test data, I finally found a pattern:
For certain characters, if the lower order bits of one character were the complement of the lower order bits of the following character, one of those lower bits in the second character would be forced on erroneously. (This only occurred when transmitting from Multics.)
With that discovery, we started to make progress again. First, since we had not detected with problem previously with the PDP-8, we recognized that we had been encountering undiscovered errors in our communications between Multics and the PDP-8. We fixed that by inventing an escape protocol, whereby the Multics transmission code would check for the special complement pattern and insert an escape sequence. The PDP-8 would then undo that escape sequence. We then exchanged millions of characters between Multics and the PDP-8 to prove that we had correctly identified and isolated the problem.
Based on this I went to the Honeywell Field Engineers. (I will note that we simply assumed that the problem that was detected on the PDP-8 was the same problem that I was having with GCOS -- but there was no alternative.) Because this involved only communications lines, the FEs could work on the problem during normal service. I set up a program that would repeatedly send a small frame containing one of the error-producing sequences, and the FEs got out their logic probes and oscilloscopes. And, after a few days, the answer came back: yes, there was a hardware problem, but no, there didn't seem to be any way to fix it!
Well, this was a new and unexpected development -- and a major problem. First, I don't think that there were any other synchronous line cards for the GIOC anywhere in the world, so we couldn't just try another card. So, maybe we could set up an escape sequence with the GCOS system in the same way that we had with the PDP-8. Nope, I was informed that there was no way to make any changes to the GCOS system.
And with that, my GRTS emulator project was dead in the water. All tapes going to Phoenix would go by air.
Luckily, the processor bring-up process encountered relatively few problems, and the hardcore programmers were able to key in patches for many of the small problems that they encountered. And, in a few months, the 6090 processor was ready to ship to Cambridge.
To this day, I still don't know why we couldn't have gotten a program written to accept the escaped data and write the MST tape. On Multics it would only have been about a couple of hundred lines of code. Grumble, grumble.
Still, it was a very abrupt and disappointing end of a project for me.
written 29 Feb 2016
New Intro and other mods as suggested by THVV. (2016-03-06)