Disaster Recovery Procedures (ATNF - Narrabri)

This document contains procedures that are to be followed in the case of a disaster affecting computer/network components crucial to the operation of the telescope (offline and online). Routine administration procedures vital to maintaining a reliable system such as configuring disks and maintaining well defined tape backups are also outlined here.

A copy of this document is to be kept in the system administrators room in the case where this information is inacessible due to the web server being unavailable.

There are a number of possible outcomes that are discussed below. In each the appropriate staff and affected users are to be notified.
  Screen Room destroyed

If this was ever to occur we would surely be up the creek without a paddle. Critical systems in the system administration domain are indicated below:

  Control Room Destroyed
  kaputar

Current Specifications/Configuration

The following manuals are located in the system administrators office. The documentation provided in these manuals can be most useful during a disaster recovery:
  • Restore the operating system and check the integrity of the filesystems. The steps are briefly outlined below:
  •   kaputar's RAID disks

    Specifications on kaputar's RAID disks can be obtained by running the StorageWorks RAID Array 200 Management Utility v1.1.1 (/usr/bin/swxcrmgr).

    Specifications on RAID disks (on kaputar)

    Channel         Vendor          Model           Rev:        Size (Mb)
    ----------------------------------------------------------------------------
      A-0           DEC             RZ29B           0016            4091
      A-1           DEC             RZ29B           0014            4091
      B-0           DEC             RZ29B           0014            4091
      B-1           DEC             RZ29B           0016            4091
      B-2           DEC             RZ29B           0016            4091
      B-3           DEC             RZ29B           0016            4091
      C-0           Quantum         XP34300W        L915            4101
      C-1           Quantum         XP34300W        L915            4101
      C-2           Quantum         XP34300W        L915            4101
      C-3           Quantum         XP34300W        L915            4101
      D-0           DEC             RZ1DB-CA        LYJ0            8678
      D-1           DEC             RZ1DF-CB        0372            8678
      D-2           DEC             RZ1DB-CA        LYJ0            8678
      D-3           DEC             RZ1DB-CA        LYJ0            8678
    ----------------------------------------------------------------------------


    Logical Drive Table (RAID array)

    -----------------------------------------------------------------------
    Drive           RAID            Size            Status
    Group           Level           (Mb)
    -----------------------------------------------------------------------
      A              1               4091           Optimal
      B              5              12273           Optimal
      C              5              12303           Optimal
      D              5              26034           Optimal
    -----------------------------------------------------------------------


    RAID Levels and redundancy

    Configuring RAID groups
      ningadhun

    Current Specifications/Configuration

    Determine and identify damaged components. If the server has been completely destroyed (e.g. a fire) then goto section 1 otherwise proceed to section 2. A special case is section 3 where one is expecting a certain failure and has time to move the data to another server.
    1. Establishing a new PC server
    2. ningadhun is partially damaged
    3. Promoting a BDC to a PDC

    1. Establishing a new PC server

    2. This section is relevant for the scenario where the data on ningadun is not accessible due to a major hardware failure.



    3. ningadhun is partially damaged

    4. This section refers to some component of ningadhun (DELL PowerEdge 2300) being damaged to such an extent that the continuation of normal user services is prevented, thus requiring urgent attention..
    5. Promoting a BDC to a PDC

    6. This procedure is relevant if one suspects an imminent failure of ningadhun and there is still time to move the data to another server.


      noel

    A number of outcomes are discussed below.

    1. System disk damaged (VMSSYS0)
    2. Data disk damaged ($DISK3)
    3. Faulty monitor(s)
    4. Does not power up
      leon

    A number of outcomes are discussed below.

    1. System disk damaged (VMSSYS1)
    2. Data disk damaged ($DISK0)
    3. Faulty VT320 terminal
    4. Does not power up
      System Backup

    Unix and WinNT machines

    Contents of the file /etc/fstab.

    The following details are shown from left to right respectively: Name of filesystem, mount point on kaputar, type of filesystem and read/write details. All the filesystems indicated below are backed up (except for swap1, swap2, /syscdrom0, /syscdrom1, /data/KAPUTAR_2 and /data/KAPUTAR_3).

    =====================================================================================
    Filesystem         1024-blocks  Used Available Capacity Mounted on
    =====================================================================================
    /dev/re0b               swap1                   ufs     sw 0 2
    /dev/rz1b               swap2                   ufs     sw 0 2
    /dev/re0a               /                       ufs     rw 1 1
    /dev/re0g               /usr                    ufs     rw 1 2
    /dev/re0h               /x                      ufs     rw 1 2
    /proc                   /proc                   procfs  rw 0 0
    /dev/rz4c               /syscdrom0              ufs     ro 0 0
    /dev/rz5c               /syscdrom1              ufs     ro 0 0
    #/dev/fd0c              /fd                     ufs     rw 0 0
    kaputar#opt             /opt                    advfs   rw,userquota,groupquota 0 2
    kaputar#usrlocal        /usr/local              advfs   rw,userquota,groupquota 0 2
    kaputar#atapplic        /atapplic               advfs   rw,userquota,groupquota 0 2
    kaputar#applic          /applic                 advfs   rw,userquota,groupquota 0 2
    kaputar#AIPS            /AIPS                   advfs   rw,userquota,groupquota 0 2
    kaputar#narusers        /narusers               advfs   rw,userquota,groupquota 0 2
    kaputar#source          /source                 advfs   rw,userquota,groupquota 0 2
    kaputar#SOLARIS2local   /export/SOLARIS2local   advfs   rw,userquota,groupquota 0 2
    kaputar#SOLARIS2opt     /export/SOLARIS2opt     advfs   rw,userquota,groupquota 0 2
    kaputar#aips++          /aips++                 advfs   rw,userquota,groupquota 0 2
    kaputar#ATOMS           /ATOMS                  advfs   rw,userquota,groupquota 0 2
    kaputar#www             /www                    advfs   rw,userquota,groupquota 0 2
    data#visitors           /data/KAPUTAR_1         advfs   rw,userquota,groupquota 0 2
    data#students           /data/KAPUTAR_2         advfs   rw,userquota,groupquota 0 2
    data#localdata          /data/KAPUTAR_3         advfs   rw,userquota,groupquota 0 2
    #ningadhun:/T/users     /nt/users               nfs     rw,nfsv2
    #ningadhun:/T/cad       /nt/cad                 nfs     rw,nfsv2
    #ningadhun:/S/pcapps    /nt/apps                nfs     ro,nfsv2
    #%aips2.nrao.edu:/export/aips++/master /aips++_master nfs       rw,grpid,hard,intr,retrans=20,tim
    eo=60
    =====================================================================================
      VMS Cluster
    Current status of cluster mounted devices:
    $ show dev/m
    
    Device                  Device           Error    Volume         Free  Trans Mnt
     Name                   Status           Count     Label        Blocks Count Cnt
     $1$DKA0:        (NOEL)  Mounted              0  VMSSYS0         793821     5   4
     $1$DKA100:      (NOEL)  Mounted              0  $DISK3          106104    58   4
     $2$DKA0:        (LEON)  Mounted              0  VMSSYS1         639485   514   4
     $2$DKA100:      (LEON)  Mounted              0  $DISK0          222690    26   4
     $2$DKA400:      (LEON)  Mounted              0  PAGEDISK          9699     2   1
     $12$DKA0:     (DELPHI)  Mounted wrtlck       0  VAXDOCMAR951    369165     1   4
     $12$DKA200:   (DELPHI)  Mounted              0  $DISK1         1267092     7   4
     $12$DKA300:   (DELPHI)  Mounted              0  DELPHI_1037      47295     1   4
     $15$DKA0:      (DESK2)  Mounted              0  KOALA          1344708     1   4
     $15$DKA300:    (DESK2)  Mounted              0  DESK2_1255       41412     1   4
    
     Device                  Device           Error
      Name                   Status           Count
      LTA0:                   Offline mounted      0
      RTA1:                   Mounted              0
      RTA2:                   Mounted              0

    Shutting down the VMS Cluster

    The VMS cluster need to be shutdown in the following order.

    To re-establish the cluster boot the machines in the reverse order.

      Fileservers and critical workstations
    Host name Description Operating System Location Purpose Serial Number
    kaputar DEC Alpha 1000A 5/400 Digital Unix v 4.0B Computer Room UNIX, Web, Mail server,etc... AY45700625
    ningadhun DELL PowerEdge 2300 WinNT Server v4.0 SP5 Computer Room PC server - user data/applications TK2Q7
    leon MicroVax 3100-80 Open VMS v 6.1 Computer Room VAX server KA224R6060
    noel Vaxstation 4000-60 Open VMS v 6.1 Control Room Compact Array control computer AB22202U7T
    atria Digital PC Red Hat Linux v5.2 Correlator Room Correlator data acquisition computer -
      SUN workstations

    All the currently used SUN workstations are listed below. In the case of hardware problems one may place a service call with SUN Service Centre quoting the serial number and describing the problem to the customer service representative. The serial numbers of the last two workstations are not indicate since they are no longer covered under service warranty.
     
    Host name Description Op. System Location Purpose Serial Number
    achilles SUN Ultra 10 Solaris 2.5.6 Computer Room Solaris Server/Workstation HW82004508
    medea SUN Ultra 10 Solaris 2.5.6 Computer Room SUN workstation HW82004506
    ambrosia SUN Ultra 10 Solaris 2.5.6 Computer Room SUN workstation HW82004503
    orpheus SUN Ultra 10 Solaris 2.5.6 Control Room Online imaging FW84750472
    poseidon SUN Ultra 10 Solaris 2.5.6 Mark Wieringa's office SUN workstation FW84750488
    argos SUN Ultra 10 Solaris 2.5.6 Dave McConnell's office SUN workstation FW84750478
    molen SparcStation 5 Solaris 2.5.6 Observer's Area SUN workstation 425F5931
    corvus Sparc ULTRA 10 Solaris 2.5.6 Observer's Area SUN workstation FW93230158
    vladimir SparcStation 5 SUN OS v4.1.3 Control Room SUN workstation -

      Contact Names
    Company Name Phone Fax Contact Name(s) and details
    Arrow Direct P/L (03) 9763 8433 (03) 9763 8823 Geoff Bull or John Taylor
    ComNet Solutions (02) 9899 5700 (02) 9634 1432 Brian Denley
    Connections P/L (02) 9552 3088 (02) 9552 3258 David Simmons
    COMPAQ Customer Serivce 1300 368 369 - -
    COMPAQ Serivce Calls 1300 788 990 - Customer ID: 363250
    DELL Sales Representative 1800 803 385 1800 818 341 Abe Khamis or Kevin Keheo
    DELL Credit Manager (02) 9930 3355 - Kylee Mace
    DELL Technical Support 1800 808 378 - -
    DELL Delivery Enquiries 1800 819 339 - -
    Epson - Technical Support (02) 9903-9040 - -
    HP Direct 131 047 - -
    Hunter Digital (02) 4968 4455 - Marty Wilson
    Software Spectrum (02) 9418 3811 - Michael van Zoggel
    Sun Microsystems Aust. P/L (02) 9466 9466 (02) 9466 9410 Robert Drake
    Sun Service Centre 1800 555 786 - -

      Service Contract
    Location Licenses Box file (System Admin's Office)
    Account Number 0195340
    Agreement Number 5908200600B
    Administrator Paul Cruz



    Created: John Giovannis (6-Aug-1998)
    Modified: John Giovannis (9-Nov-1999)