Hellandizing

I had the privilege of working with Pat Helland at Tandem Computers in the 1980s. His intelligence, creativity, good humor, and ebullience were a rare gift to his colleagues. (Pat had done great work at BTI Computer Systems, now largely forgotten.)

In 1984, Pat invented a technique to help test server programs. We called the method Hellandizing a program to be tested. It involves placing test points at all significant locations in a program. When the program reaches a test point, it consults a master debug-mode switch to see if testing is enabled, and just continues if not. If testing is turned on, the test point checks a control vector for instructions. Each test point has an entry in the vector, which instructs the program to perform one of a set of actions:

Continue
Abort the program
Reset the test point and abort the program
Crash the CPU
Reset the test point and crash the CPU
Sleep for a short time
Wait for a signal
Invoke the debugger
Enter a trace message in the debug log
Count occurrences of the test point

and so on. An interface program allows the programmer to examine and change the test vector from outside the program, by means depending on the program environment; for processes operating in privileged mode, low-level message facilities were used.

The particular program Pat was testing was a fault tolerant server consisting of two processes running on different CPUs. As the primary process executed, it sent checkpoint messages to the backup process, so that the backup process could take over if the primary process or its CPU died. (Tandem called these "NonStop" programs.) Checkpoint calls were placed just before every call by which a process affected the rest of the world, whether by file I/O or by sending a message to another process.

Pat placed test points at each checkpoint. For a program that doesn't have a backup, the principle is the same: place a test point just before every operation that reveals the internal state of the program.

Hellandizing makes disciplined testing possible for server programs even though their state is concealed, their logic complex, and their timing dependent on interaction with other processes. The test points identify the places where the program interacts with the outside world and give the tester a handle on each one. We used to test NonStop programs by "walking" a breakpoint down the code of a program 100 locations at a time, and forcing a take-over at each breakpoint: but why 100? Forcing a takeover at each test point ensures that every distinguishable state of the process has been tried. To test timing dependencies, we can set the control vector to stop or delay the program at each critical point.

In a system of interacting processes, the ability to have a process under test wait for a signal from a utility program invoked from a test driver means that every critical race can be opened up, and all important sequences of events can be reproducibly tried. The test script can simulate intervening events or cause external failures before allowing the program under test to continue. We can use this approach to answer questions like "what if the communications line failed before we reached this point in the code?"

Test points make it possible to design systematic test libraries which explore every combination of object state and operation, provided that the test designers can get deep understanding of the code. Thorough analysis of the program being tested is still necessary in order to enumerate the number of cases that must be tested, and to understand what cases each test covers.