High availability is a characteristic of a system, which describes the duration (length of time) for which the system is operational. So when it is said that some service has 99% availability across a year, it means that from across entire duration in a year (24*365=8760 hours, not accounting for leap years), the service would be operational for 8672.4 hours. A highly available system will be one which is operational for minimum duration within the specified length of time (usually a year).
The literal definition of availability is
Ao = up time / total time.
This equation is not practically useful, but if (total time - down time) is substituted for up time then you have
Ao = (total time - down time) / total time.
Determining tolerable down time is practical. From that, the required availability may be easily calculated.
High availability system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.
There are three principles of high availability engineering. They are:
- Elimination of single points of failure. This means adding redundancy to the system so that failure of a component does not mean failure of the entire system.
- Reliable crossover. In multithreaded systems, the crossover point itself tends to become a single point of failure. High availability engineering must provide for reliable crossover.
- Detection of failures as they occur. If the two principles above are observed, then a user may never see a failure. But the maintenance activity must.
Modernization has resulted in an increased reliance on these systems. For example, hospitals and data centers require high availability of their systems to perform routine daily activities. Availability refers to the ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is - from the users point of view - unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable.
Scheduled and unscheduled downtime
A distinction can be made between scheduled and unscheduled downtime. Typically, scheduled downtime is a result of maintenance that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Scheduled downtime events might include patches to system software that require a reboot or system configuration changes that only take effect upon a reboot. In general, scheduled downtime is usually the result of some logical, management-initiated event. Unscheduled downtime events typically arise from some physical event, such as a hardware or software failure or environmental anomaly. Examples of unscheduled downtime events include power outages, failed CPU or RAM components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, security breaches, or various application, middleware, and operating system failures.
If users can be warned away from scheduled downtimes, then the distinction is useful. But if the requirement is for true high availability, then downtime is downtime whether or not it is scheduled.
Many computing sites exclude scheduled downtime from availability calculations, assuming that it has little or no impact upon the computing user community. By doing this, they can claim to have phenomenally high availability, which might give the illusion of continuous availability. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any single point of failure and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements. For certain systems, scheduled downtime does not matter, for example system downtime at an office building after everybody has gone home for the night.
Percentage calculation
Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows the translation from a given availability percentage to the corresponding amount of time a system would be unavailable.
Availability % | Downtime per year | Downtime per month | Downtime per week | Downtime per day |
---|---|---|---|---|
90% ("one nine") | 36.5 days | 72 hours | 16.8 hours | 2.4 hours |
95% | 18.25 days | 36 hours | 8.4 hours | 1.2 hours |
97% | 10.96 days | 21.6 hours | 5.04 hours | 43.2 minutes |
98% | 7.30 days | 14.4 hours | 3.36 hours | 28.8 minutes |
99% ("two nines") | 3.65 days | 7.20 hours | 1.68 hours | 14.4 minutes |
99.5% | 1.83 days | 3.60 hours | 50.4 minutes | 7.2 minutes |
99.8% | 17.52 hours | 86.23 minutes | 20.16 minutes | 2.88 minutes |
99.9% ("three nines") | 8.76 hours | 43.8 minutes | 10.1 minutes | 1.44 minutes |
99.95% | 4.38 hours | 21.56 minutes | 5.04 minutes | 43.2 seconds |
99.99% ("four nines") | 52.56 minutes | 4.38 minutes | 1.01 minutes | 8.66 seconds |
99.995% | 26.28 minutes | 2.16 minutes | 30.24 seconds | 4.32 seconds |
99.999% ("five nines") | 5.26 minutes | 25.9 seconds | 6.05 seconds | 864.3 milliseconds |
99.9999% ("six nines") | 31.5 seconds | 2.59 seconds | 604.8 milliseconds | 86.4 milliseconds |
99.99999% ("seven nines") | 3.15 seconds | 262.97 milliseconds | 60.48 milliseconds | 8.64 milliseconds |
99.999999% ("eight nines") | 315.569 milliseconds | 26.297 milliseconds | 6.048 milliseconds | 0.864 milliseconds |
99.9999999% ("nine nines") | 31.5569 milliseconds | 2.6297 milliseconds | 0.6048 milliseconds | 0.0864 milliseconds |
Uptime and availability can be used synonymously, as long as the item being discussed are kept consistent. That is, a system can be up, but its services are not available, as in the case of a network outage. This can also be viewed as, a system can be available to work on, but its services are not up from a functional perspective (as oppose to software service/process perspective). The perspective is important here, whether the item being discussed is the server hardware, server OS, functional service, software service/process...etc. Keep the perspective consistent throughout a discussion, then uptime and availability can be used synonymously.
Percentages of a particular order of magnitude are sometimes referred to by the number of nines or "class of nines" in the digits. For example, electricity that is delivered without interruptions (blackouts, brownouts or surges) 99.999% of the time would have 5 nines reliability, or class five. In particular, the term is used in connection with mainframes or enterprise computing.
In general, the number of nines is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a probability (like 0.00001), or a downtime per year is quoted. Availability specified as a number of nines is often seen in marketing documents.
The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence.
For large amounts of 9s, the "unavailability" index (measure of downtime rather than uptime) is easier to handle. For example, this is why an "unavailability" rather than availability metric is used in hard disk or data link bit error rates.
A formulation of the class of 9s based on a system's unavailability would be