Tuesday, November 3, 2009

Overview of Highlevel Availability

Why High Availability?

Mission critical computer systems need to be available 24 hours a day, 7 days a week, and 365 days a year.

When this system and the applications are not available, due to technical problems, we suffer loss of productivity at work and inconvenience in our personal lives. It is not difficult to calculate the real costs incurred when critical business systems are down.
More serious consequences occur when critical systems such as traffic control, medical life support, or Health services systems are not functioning.

When an application becomes unavailable, the work that it was doing simply stops.
At best, such an outage simply results in lost productivity - the application will be up and running some time later, and the work will be completed later. More serious consequences can occur through safety, legal actions, fines or simply negative publicity.
The impact of downtime will vary from business to business and within a business from application to application.

Availability, High Availability, and Fault Tolerance: What do these terms mean?

Availability is the percentage of time that a system operates during its intended duty cycle. For example, if a given system is expected to be functional for 8 hours per day, then availability is measured as a percentage of those eight hours. If a system is non-functional outside this period, it is not counted against the “Availability Metric.”

High Availability
attempts to specify an amount of time as a percentage of the intended duty cycle that a system must be functional. For example, if we specify availability metric as “Five Nines,” it is understood to mean that the system should be functional for 99.999% of the desired duty cycle.
Refer to the following table for examples of various levels of availability and associated allowable downtime per year/month/week assuming a 24 hour per day duty cycle.

Availability % Downtime / year Downtime / month(30 days) Downtime / week
99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes

99.99% ("four nines") 52.6 minutes 4.32 minutes 1.01 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds

There are two main hardware concerns with respect to maintaining a highly available database environment: server high availability and storage availability.

Fault tolerance (Data redundancy in DB)
differs from high availability by providing additional resources that allow an application to continue functioning after a component failure without interruption. Many of the high-availability solutions on the market today actually provide fault tolerance for a particular application component. Disk mirroring, where there are two disk drives with identical copies of the data, is an example of a fault-tolerant component. If one of the disk drives fail, there is another copy of the data that is instantly available so the application can continue execution.

High Availability (HA) solution

High Availability (HA) solution must address both unplanned and planned causes of downtime to achieve a truly fault tolerant and resilient IT infrastructure.

Causes of Planned down time

Repair and upgrades that have minimal impact on the business are considered maintenance. For many applications, availability during business hours is required, but some downtime during non-business hours is acceptable.

All systems will require maintenance at some point. If management does not plan for system maintenance, the system will pick the time and duration for an outage! It is up to the system designer to understand the business need and design the system to allow for planned downtime, therefore minimizing the risk of a system failure.

Causes of Unplanned down time

 * Hardware failure
   -->Server hardware
   --> Storage hardware
 * Human error
  -->System management toll
  -->Staff training
  -->Process oriented IT organization
 * Software corruption or bug
 * Viruses
 * Natural disaster

No comments: