     This document describes the crash recovery mechanism for the
current  implementation of  the Data  Management System  (DMS) on
Multics.  The following items are discussed:

     o Finding the state of the last invocation of DM

     o Recovery of DM files to a consistent state

     o Deletion of DM tables from previous invocations

     After  a Multics  system crash,  the Data  Management System
(DMS) must  be recovered to make  all protected (or synchronized)
DM files consistent (simply referred to  as DM files for the rest
of this  MTB).  In the current  DMS implementation, this involves
rolling back  any unfinished transactions recorded  in the before
journals (BJ's) open at crash time.  As DMS recovery is very tied
up  with  DMS  initialization,  the  reader  should  have  a good
understanding of MTB 592, "Data Management:  System Structure and
Initialization".   The reader  should be familiar  with the MTB's
concerning  the DM  before journal  manager, esp.   "Phasing page
control and before journal".  These are MTB's 513, 559, 560, 563,
564,   567,   and  568.    Also,   MTB  508,   "Data  Management:
Architectural Overview" would be very useful to know.

     One of the major factors in this design of crash recovery is
the use of normal DMS software whenever possible.  Crash recovery
does not have the critical time  constraints on it that a running
DMS does; however, DMS should be available to users as quickly as
possible.  A few minutes between  Multics and DMS availability is
not felt to be crucial, but  it certainly should not take half an
hour.  It is felt that time can be better spent making the normal
DMS  software  run better  and reduce  the number  of specialized
programs needed (and the associated maintainence cost).


     Crash  recovery is  an integral part  of DMS initialization,
done by a DMS Daemon, after a Multics bootload.  Recovery is done
about half-way through DMS  initialization, after a temporary DMS
has  been  created.   See  MTB  592,  "DMS  System  Structure and
Initialization" for more detail about initialization proper; this
MTB will attempt to stay  within the recovery process except when
initialization must be referenced.

     The  following  is a  basic list  of the  steps done  in DMS
recovery  (by the  program dm_recovery_.pl1).  The  items will be
discussed in more detail in later sections.

    o  Find the  bootload directory for the  DMS to be recovered.
       This step may fail if the DMS is running, or the hierarchy
       leading  to  and/or including  the bootload  directory has
       been lost.

    o  See if the previous DMS  bootload needs to be recovered by
       examining  the  state indicator  in  the old  tables; stop
       recovery if normal shutdown was indicated.

    o  Find the previous bootload's file_manager_ UID to pathname
       table.   This  is used  to  open DM  files that  have been
       modified and must be made consistent.

    o  Open all before journals open at crash time.

    o  Loop  through  the  opened  journals  finding  all  active
       transactions and rolling them back.

    o  Delete  or  rename  the  previous  bootload's  tables  and
       hierarchy and generally cleanup  the DMS system hierarchy.
       This completes the recovery procedure.

     In  the course  of the  above operations,  errors may occur.
These  errors  are  logged  in  a  DMS  system  log  kept  by the
initializer  of Data  Management.  A primitive  handling of these
errors is  done by flags  set by an administrator  of DMS.  These
flags are:  initializing,  always_enable, and rename_old_dms_dir.
If initializing is on, a previous bootload of DMS must not exist,
and  the  other two  flags  are ignored;  basically,  recovery is
useless as  nothing exists to  recover.  Otherwise, the  last two
flags are used.  If recovery takes any error and always_enable is
on, the errors will be reported as normal, but DMS initialization
will  continue with  the step  after recovery.   Regardless if an
error occurs, if rename_old_dms_dir is on, the previous directory
containing   the   DMS   tables   will  be   renamed   for  later
investigation.  This is only recommended for debugging.


     One  premise  of  crash   recovery  is  a  DMS  per-bootload
directory  exists containing  two critical tables:   the DMS file
and  before  journal  managers'  UID-pathname  tables,  which are
flushed to disk  each time ANY modification is  made to them, and
so are guaranteed accurate.  If a per-bootload directory does not
exist and  the initializing indicator is  off, or if initializing
is on and a per-bootload DOES exist, recovery takes an error that
is fatal to the current attempt to boot DMS.

     Once the directory is found, an attempt is made to check the
state  of  the DMS  invocation  found.  If  a normal  shutdown is
indicated,  nothing  more  needs  to  be  done  and  recovery  is
finished.   Next,  the  file   manager's  UID-pathname  table  is
located;  the finding  of the  before journal  manager's table is
left to later.

     Three  programs  are  used  to check  for  the  above items:
dm_util_$find_old_boot_dir                (to                find
dm_dir.<Multics_bootload_time>), dm_util_$dm_status (to  see if a
normal             shutdown             occured),             and


     At  this   point,  it  is  important   to  realize  two  DMS
per-bootload    directories    are    in    use    by   recovery:
dms_dir.<Multics_bootload_time>                               and
dms_dir.<Multics_bootload_time>.temp.  The first is the directory
containing  the data  required for  recovery.  The  latter is the
active   version   of  DMS   where  the   DMS  Daemon   is  doing
initialization, and so recovery.

     It is possible that no  active transactions were left in the
last  DMS  invocation, but  the  old transaction  tables  are not
guaranteed consistent, only the file and before journal managers'
UID-pathname  tables.  The  only way  to be  sure no transactions
were left unfinished is to read all the before journals listed in
the  before journal  UID-pathname table  and read  them backwards
looking for active transactions.  The  before images and marks in
a before journal  are trusted (according to the  protocol that DM
files  control intervals  will not be  written to  disk until the
matching before image(s) are on disk).

5.1 Open before journals for recovery

     The  procedure  before_journal_manager_$open_all_after_crash
does  this step.   It finds  the old  before journal UID-pathname
table and loops through it opening all journals listed as active.
If  no  journals  are found  in  the  list, nothing  needs  to be
recovered.   Any  journal opened  is recorded  in the  new before
journal         UID-pathname         table         in         the
dm_dir.<Multics_bootload_time>.temp directory.

     In the process of opening  the journals, they are positioned
to the  last control interval written  to.  This control interval
is recorded  in CI0 of  the journal.  The definition  of the last
control interval in  a journal is that CIn  is last if time_stamp
(CIn)  >  time_stamp  (CIn +  1).   (Remember that  a  journal is
circular and if CI_ is the last CI in the journal, CI_ + 1 is CI1
of the journal.)

     Note the control interval found as being last in the journal
is not necessarily the last one written on the operational system
we are recovering.  Especially in a no-ESD crash, a CI could have
been  written in  memory, but the  contents not be  on disk.  The
result is a transaction could  have been started or completed and
no record is left for recovery.  However, since the writing of BJ
CI's and DM file CI's are phased  so the BJ CI's will always make
it  to disk  first (except for  abort and commit  marks, in which
case the situation is reversed), recovery does not care.  Minimal
work will be  lost in this situation.  See MTB's  563 and 564 for
more detail on this.

5.2 Create a Temporary Transaction Table

     Build a table of all transactions recorded in the BJ's which
have  not  completed.   Two  lists are  kept:   one  of completed
transactions  (do  not  rollback),  and  one  of  transactions in
progress  as  far  as recorded  data  indicates in  the  BJ's.  A
transaction with extra work to do  after it is committed is still
considered  in  progress; it  will  not be  rolled back,  but the
post-commit actions will be done.

     Each BJ record has the  number of active transactions in the
BJ  at the  time the  record is  written recorded  in its header.
This  is  used  so  the entire  BJ  does  not have  to  be walked
backwards  to guarantee  all active transactions  are caught.  By
convention (and  common sense), commit  and abort records  do not
count themselves as active transactions when written to a BJ.

     Note the previous step does  not have to be completed before
this one is  called.  The steps are put in  a loop where only one
BJ is  examined and worked over  at a time.  If  at this point in
recovery, no transactions  were active, we simply go  to the step
to close all BJ's opened.

5.3 Finish the Transactions Found

     If the temporary transaction table  is not empty, invoke the
procedure transaction_manager_$recover_after_crash with a pointer
to the temporary transaction table.  It does the following steps:

     The  DMS state  indicator, dm_system_data_$current_dm_state,
is set to show recovery is in  progress.  This is used by some of
the  DMS Daemon's  transaction adjustment  programs to  know that
some  special  calls need  to  be made.   (This is  actually done
earlier, but only has relevance to recovery now.)


     The temporary transaction table is looped through building a
valid transacton definition table for transaction manager.


     Call  before_journal_manager_$rebuild_after_crash  with  the
pointer to the temporary transaction table.


     Now loop  calling tm_adjust_txn.  This is  the normal method
of adjusting a  transaction for a dead process  in an active DMS.
This is done  so the before journals will  be consistent when the
adjustment is  finished.  In a sense,  the transactions read from
the BJ's have been adopted by  the now partially active DMS (i.e.
users still cannot access it).


     All  of  crash  recovery  is  done  except  for  some  house
cleaning.   First,  call  file_manager_$end_of_crash_recovery  to
null   out   the  internal   pointer   kept  for   the   call  to
file_manager_$open_by_uid_after_crash.   This is  to help prevent
accidental  modification  to a  file  through this  pointer after
recovery is complete.  Next, close all BJ's opened in the process
of doing the above examinations  and rollbacks (Note the DM files
have  already  been  closed).   This is  done  to  clear  the per
bootload BJ UID to pathname table for a fresh start.

     At this point, recovery is  actually done.  However, the old
dm_dir.<Multics_bootload_time>   is   simply   using   quota  for
(usually) no good reason.  If the rename_old_dms_dir flags is on,
the      old      directory      will      be      renamed     to
dm_dir.<Multics_bootload_time>.hold  and  will  be  available for
examination by a suitably  privileged user later.  Otherwise, the
directory will be deleted and  its quota recovered.  This step is
not necessary  before the users  are allowed into  DMS.  Although
this  step  is  somewhat  part of  recovery  processing,  it will
actually be done as part of DMS initialization.


     DMS crash recovery requires  some conventions be observed by
DMS shutdown.  The major  requirement is the dms_dir.BOOTLOAD not
be  deleted.   This  serves  to  give  crash  recovery  an  extra
assurance the  system shutdown normally  (if it did),  instead of
not being sure the directory was lost  in a crash or not.  If the
state  indicator  in the  old  dm_system_data_ is  set  to normal
shutdown, no crash recovery need be done.


     One  of  the  major  assumptions  of  DMS  recovery  is  the
directory hierarchy containing the  critical system-wide DMS data
can be found.  Directories are not DM files, however.  If when an
invocation of DMS is made  available to users, the DMS per-system
hierarchy is  flushed up to  the root directory,  some cases that
are possible for  lossage can be avoided.  The  use of DIRW seems
extreme for just this one instance.   If this is not feasible, it
is still unlikely to even show a problem in most crashes.