Disaster Planning Basics

Even in the Information Age, an ounce of prevention is still worth a pound of cure

by / August 31, 1995
Sept 95 By Joe Plasky Conner Storage Systems With a growing number of state and local government agencies dependent upon PC LAN systems for managing day-to-day activities, fault tolerance and disaster prevention planning have emerged as important issues. Government managers recognize that losing important data can devastate an organization, halting the delivery of vital services or impacting its ability to make policy judgments. It is critical that government technology managers understand the issues involved with disaster prevention planning, and take steps to create and implement their own plan

FAULT TOLERANCE Installing fault tolerant systems to prevent data loss should be a first step in any government agency disaster prevention plan. For example, RAID (Redundant Arrays of Independent Disks) systems offer data redundancy and non-stop performance by tightly integrating a series of hard disk drives in one array. They provide reliable, cost-effective solutions for protecting real-time data in the event of a server's hard disk failure. With a RAID system, data can be accessed during a disk failure, replacement or repair

RAID systems offer high-capacity data storage, making them particularly well-suited for data-intensive applications, such as GIS, imaging and multimedia. Most RAID systems allow users to increase capacity by either upgrading the disk drives or by adding modules to the RAID system. In selecting a RAID product, look for a system that is Novell-certified for NetWare 3.1x and 4.0x, the major LAN operating systems in state and local government

RAID systems provide many data storage benefits. For example, RAID systems are designed for reliability. With multiple disk drives, and redundant power supplies and cooling fans, the mean time between data loss for a RAID system can be in excess of five million hours. This feature is crucial when an end user's data must be available at all times. Also, a RAID system should provide hot swapping capabilities so that failed drives can be replaced online without system interruption or loss of data. This means that a drive failure would no longer result in the loss of data or even the loss of access to data

DISASTERS: LARGE AND SMALL An effective disaster recovery plan addresses two types of disasters. The first is a local incident that affects the operation of the PC LAN system but still gives administrators the ability to work the problem on-site. The second type of incident is a natural disaster, such as an earthquake or fire. When a natural disaster strikes, an entire region can be affected

This often means the system and its support structure are not accessible for days and data may even be permanently lost

A basic disaster prevention and recovery plan has four essential parts: * Risk analysis and site audit * Hardware protection * Data protection and recovery * Contingency plans

RISK ANALYSIS AND SITE AUDIT Risk analysis is the initial audit that should identify types, locations and amounts of hardware, software and data. Hardware and software audits can be simplified by using software that will automatically inventory your departmental network. You will need this information to plan how much hardware and software will need to be replaced in a major disaster, who needs it, how much it will cost and where it is located

Data audits are often more difficult. Conducting a data audit will identify the most critical computing needs of a government office, who uses the data, and how long the agency can survive without the information and computers that are needed to access the information

HARDWARE PROTECTION Once a risk analysis has been conducted, it is wise to perform a site audit of all LAN related hardware. With hardware, preventive measures are key to avoiding localized disasters. This includes, but is not restricted to: Security Have you put measures in place so equipment does not "grow feet?" Be sure to perform regular physical inventories and to keep key pieces of equipment under lock and key

Preventive Maintenance Information systems managers should develop a preventive maintenance schedule and stick to it. They should identify key points of failure including power supplies and hard disks

Power Sources Is there an uninterruptable power supply (UPS) at each server? If you work in an area of the country where lightning is a problem, consider installing a UPS at each workstation or, at the very least, a good surge suppressor should be installed at each point of entry into the LAN including telephone lines that hook into modems

Fire Protection Halon fire extinguishers are recommended for computer fires because they stop the fire while preserving the computer data. Water and other chemicals will put out the fire but may destroy hardware and data. Be sure to keep in mind that many buildings have built-in sprinkler systems that can destroy computer equipment even if the fire is in a completely different part of the building

DATA PROTECTION AND RECOVERY Typically, the data stored on a network is more valuable to an organization than the combined value of the hardware, software and LANs that maintain it. This is particularly true in government agencies that have long record retention requirements. Regular backup procedures, that copy or archive data to offline storage, must be maintained so data restoration can be done at a moment's notice. A qualified LAN backup system should be able to meet a user's complete data management needs by backing-up data on network servers and local drives, as well as backing-up security information

After you complete the risk analysis, site audit and plans to protect your hardware and data, you should develop a recovery plan should a disaster occur. The recovery plan outlines how the organization will restore systems and data as quickly as possible and present the appearance of "government as usual" without computers

Set priorities for restoration based on the information gathered during the risk analysis. For example, which systems and networks need to be restored first, second, third and so on? Alternate computing locations should be found in case your facility is inaccessible. If there are multiple LANs within an agency, arrangements can be made for temporary usage of internal resources. Often, internal training LANs can be used during an emergency

CONTINGENCY PLAN Draft a contingency plan that includes plans for your government agency to run manually for a few days. Having this contingency plan in hand can have a critical impact since, in times of disaster, businesses and citizens often turn to government agencies for quick assistance. The plan should detail electronic workflows, within agencies as well as among multiple agencies, that need to be carried through "sneakernet" to keep operations running smoothly. The strategic advantage of maintaining work processes, by having access to paper information, can be enormous during times of regional disasters

After the plan has been finalized, test it. Testing ensures that all areas have been covered. As the LAN systems change, plans need to be reviewed and updated on a regular basis

There is an old saying, "People don't plan to fail, they just fail to plan." The government agency with a disaster prevention and recovery plan in place will have the ability to survive any computer disaster they may encounter