Multics Record Stocks and the Physical Volume Scavenger

Olin Sibert, John Bongiovanni

[Olin Sibert] Record stocks is a Multics feature introduced in 1982 to improve the way the Multics file system allocates disk records. The scavenger is a system repair facility that runs after a crash to correct volume accounting. The effect of these two features is to make the system available sooner after a system crash. Both features were designed and implemented by John Bongiovanni. I recall that there were essentially no problems when it went live.

Record (and VTOCE) stocks happened close to the end of traditional file system design approaches, where the file system's recovery mechanisms were built with an intimate understanding of file system's own metadata structure, and that metadata was organized and stored in ways designed to facilitate reliability and recovery. This approach (seen in Multics VTOCEs, Unix i-nodes, and OS/360 VTOCs among other places) has been largely supplanted by newer log-structured file systems that perform recovery by replaying logged transactions without caring too much about the exact content of those transactions. I've never delved into the proofs that log-structured file systems are safe against all possible interruptions in the same sense that we could make rigorous arguments about the Multics file system, and I'm also curious about the relative I/O costs for both normal operation and recovery, but log-structured certainly seems to have taken over from traditional approaches.

Motivation

[Olin Sibert] The record stock / scavenger implementation was motivated partly because it was a neat idea--it's based on a concept that we got from the Amber operating system (of Broughton, Charles Frankston, Daniel Weinreb, et al.) implemented for the Livermore S-1 computer system.

However, it also solved a real problem: the lengthy recovery time for non-ESD crashes that was caused by the need to run a several-minute volume salvaging process on each physical disk. There had been some improvements in running multiple volume salvagers simultaneously (as Salvager.SysDaemon processes, after the root logical volume (RLV) had been fully (and sequentially) salvaged, but that only reduced a several-hour process to an hour or so. At the time record stocks came around, I believe we were contemplating a special-purpose ring zero mechanism (i.e., kludge) that would run multiple salvagers for the RLV to speed that up--in fact, I think it may have been my assignment to make it happen.

Fortunately, record stocks eliminated the recovery time entirely: reboot after a non-ESD crash became just as quick as reboot after a normal shutdown.

Along with the record stock mechanism, a similar mechanism of VTOCE stocks was implemented to manage the allocation of VTOC entries (VTOCEs) and to ensure fast recovery for them as well. It was substantially simpler than the record stock mechanism because VTOCE allocation happens in a much more comfortable environment--unlike allocating pages which has to happen at page fault time. See Ed Sharpe, MDD-007 VTOCE File System.

Storage System Background

[Olin Sibert] For any given physical volume, the information about which disk records are available is kept in the "volume map" ("volmap"), which is a bitmap of allocated records, indexed by record number. I don't remember whether a 1-bit means "allocated" or "free", but it doesn't really matter. I think the bits were stored 32/word, with the upper bits ignored, to avoid use of the EIS bitstring instructions (or divide) to find the right bit.

Allocating a record is simply a matter of setting the right bit in the volume map. As long as that volume map page gets written back to disk, everything is OK. However, it's impractical to write it back after every allocation, because of the impact of all those writes on disk throughput, so the volmap pages hang around in memory like other pages until they get "purified" and written back by the paging system--which could be either before or after the page's data gets written.

The allocation process happens during page fault handling: when a page is referenced that has no backing data (and various access and other conditions permit), a record is allocated to hold that page's eventual content.

Information about page allocations is also held in the volume table of contents (VTOC) entries (VTOCEs), but I/O for those is managed explicitly: when a VTOCE is updated, it gets written, and that is guaranteed to happen before any data is written to the on-disk locations to which it refers. This synchronization between VTOCE I/O and page I/O is the fundamental principle of the New Storage System (NSS), and is a given for the rest of this discussion.

Because the volmap pages can become inconsistent with the actual state of the disk (as defined by the page indexes in the VTOCEs), a physical volume salvager was needed to reconstruct the bitmap from the VTOCEs. This process took a few minutes per disk, which gets to be quite a burden for a big system.

Record Stock Concepts

[Olin Sibert] With that background, the idea of record stocks is straightforward: an additional mechanism and synchronization is added to NSS that ensures that the pages marked as allocated in the volmap are always a superset of the pages claimed by all the VTOCEs.

The implementation concept is equally simple: rather than being allocated one at a time from the volmap (with deferred I/O to update the volmap pages), records are withdrawn in bulk, the volmap is forced to disk, and the withdrawn records are placed in an in-memory "stock" list for use by the page fault handler. When a record is freed (e.g., when a segment is deleted), it is placed back in the volume's stock, and can be reused immediately--without any further reference to the volmap.

There is a hardcore process (I think; this control flow might have been embedded directly into page control rather than using a separate process) that monitors the record stocks: when a stock starts being depleted, the process withdraws more records from the volume map (again, forcing the I/Os to complete), and when a stock starts getting full, the process "deposits" some records from the stock back into the volmap.

If the system crashes, some records may be "lost" (withdrawn, but not actually in use), but that's OK--it just means that the volume has a bit less free space than it would otherwise. The opposite situation can't occur (in use but not withdrawn) because of the synchronous volmap I/O.

The regular NSS volume salvager will recover the lost records, but it need not be run immediately after a crash--it can be deferred until some more convenient time. Still, it's an off-line process: the volume must be demounted to do the salvage.

Record Stock Details

[Olin Sibert] As always, the devil is in the details.

If the stock maintainer process can't keep up with other activity, a stock can become completely full (e.g., many segments deleted at once) or completely empty (lots of page allocations). In this case, the stock manager has to force the offending process to wait while volmap I/O takes place to bring the stock back into equilibrium.

The volmap can be several pages, so it's impractical to keep it "wired" in memory: a system with 50 physical volumes might need 150 pages for the maps, which is a non-negligible fraction of total memory. So, the volmap pages are paged, and which means that filling the stock can cause a page fault. This is a special case in page fault handling, but no different from the original NSS case where any record allocation might require a volmap page fault.

The volmaps are (I think) treated as segments with wired active segment table entries (ASTEs) that are not part of the hierarchy. A per-process abs-seg is used to access the volmap for any particular volume.

Because record stocks are themselves not tiny (as much as half a page each?), I believe they are paged as well, so that a page fault may require taking a page fault on the record stock segment(s). This also requires a special case in page fault handling, but it's really no more complicated than page faults on the volmap segments in original NSS.

Also because the stocks are not tiny, I think there are multiple stock segments--space is grabbed in a stock segment when a volume is mounted, and released at dismount. I don't remember whether this is done with a special-purpose mechanism or a plain old area.

The Scavenger

[Olin Sibert] Deferring physical volume salvaging went a long way toward solving the slow crash recovery problem, but it still required running the salvagers eventually. To finish the solution, the physical volume scavenger was implemented.

The scavenger is an on-line process: it runs while the volume is mounted and the system is operating normally. It scans the VTOC of the volume and determines the real state of record allocation, and it has hooks into the VTOC manager, the stock manager, and the page fault handler so that it can keep track of all record withdrawals and deposits that occur while it is running.

Once it determines the true state of the volume (as represented by the VTOC and selectively supplanted by ASTEs for active segments and the volume's record stock), the scavenger builds a new, accurate, volmap and forces it to disk.

The scavenger is complicated primarily because it has to run on-line and with minimal interference to other processes accessing the volume. It has lots of interactions with the page/segment locking hierarchy. It also has to be safe from crashes, even if it is interrupted in the act of writing the updated volmap.

Because the scavenger, like the salvager, is not required to recover from a crash, it is typically run on a fixed schedule: all volumes might be scavenged early Sunday morning whether they need it or not.

The scavenger is, as I recall, a bit slower than the salvager even on an unloaded system, but not much slower--both are completely I/O bound by VTOC I/O. Normally, VTOC I/O is done in 64-word chunks; it's possible that the salvager and/or scavenger use full-page I/O instead--that would be a good deal faster, but I can't remember whether it works that way.

Tuning Parameters and Meters

[John Bongiovanni] Like other Multics subsystems of the era, the record stock and scavenger each had a set of tuning parameters and meters. The idea was that you'd look at the meters and adjust the tuning parameters for any problems you saw. A more modern approach would be adaptive.

So, for example, you could control the threshold in the record stock at which it obtained more free records, and the one at which it returned free records to the VTOC. The stock metered whenever it had to wait, either for free records or to get enough space in the stock to accomodate the free records it was being handed. The scavenger had tuning parameters that would control its speed and impact on the system.

Testing and Deployment

[John Bongiovanni] Given the complexity of the interaction of the scavenger and the stock with the running system, and the number of levels at which they were interacting, testing was a big concern. Test harnesses were developed that tested the low-level modules in isolation (in ring-4). These were followed by crash-recovery tests under high simulated paging and file system loads. We used the salvager to verify what the scavenger had fixed (I don't remember exactly how). Then we did extensive alpha testing on the MIT system. Part of this testing was running the scavenger all the time to determine whether it found problems that weren't real.

As I recall, there were very few problems when it went live.

Source

[THVV] Source code for the scavenger as of MR12.5 is available online.

The program free_store.alm in page control implements the withdraw and deposit entrypoints that obtained and released pages from record stocks. Several other programs in bound_page_control manipulate stocks. The programs stock.alm and volmap.alm are the main ones.

VTOCE stock management code is found in bound_vtoc_man.

The MR12.5 code at MIT dates from 1999, years later than the initial introduction of these features in MR10.0 in 1982. Some of the features described may have been improved or replaced during the 17 years after introduction.

19 May 2004, updated 14 Feb 2013