High Availability and Oracle Environments: Optimizing Performance and Resilience

04 June 2011

In the second part of our High Availability and Oracle environments dossier, we looked at the Importance of Availability and the Costs of Production downtime.

This is the third part of the dossier, in which we look at the causes of downtime.

Previous article: High Availability and Oracle environments - part 2: Importance of Availability and the Costs of Downtime.

Planned and unplanned shutdowns

One of the challenges in designing a high-availability solution is to examine and address all possible causes of Production downtime. It is important to examine both planned and unplanned downtime. Planned downtime can be just as disruptive as unplanned downtime, particularly in the case of international companies with users spread all over the world.

Causes of unplanned downtime
 

  1. Site failure : this is likely to affect all precessing in a data center, or a subset of the applications supported in that data center...
    • Site-wide power failure
    • Natural disaster rendering the IT site inoperable
    • Terrorist or malicious attack on applications or site
       
  2. Cluster failure: the entire cluster hosting an Oracle RAC database is unavailable or down
    • The last surviving node in an Oracle RAC cluster shuts down and cannot be restarted
    • Both redundant INTERCONNECT connections are unusable, or the entire cluster is unusable
    • Database corruption so severe that continuity is not possible on the current Oracle server
    • Disk access error
       
  3. Computer failure: when the system running the database becomes unavailable because it is down or inaccessible.
    • Database server hardware failure
    • Operating system failure
    • Oracle instance failure
    • Network interface failure
       
  4. Data storage failure: when the storage elements of all or part of the database are no longer accessible.
    • Disk failure
    • Disk controller failure
    • SAN array failure
       
  5. Data corruption: a corrupted block is one that has been altered in such a way that it is different from what Oracle expects to find.

    There are logical and physical corruptions. There is also intra-block and inter-block corruption.

    A failure due to data corruption occurs when a hardware, software or network component causes read or write data corruption. The impact on service levels following data corruption can vary, from a slight impact in the case of one or more corrupted blocks in the database, to database blocking in the case of more extensive corruption.

    Here are just a few of the factors that can lead to corruption:
    • Operating system or disk driver fault
    • Faulty bus adapter
    • Disk controller fault
    • Disk volume manager error causing disk read or write error
    • Software fault
       
  6. Human error: a user has unintentionally modified or deleted data in a database, or someone has made fraudulent changes; depending on the type of error, the consequences can be more or less serious.
    • Deletion of data files belonging to a database
    • Deletion of objects in a database (tables, etc.)
    • Unintentional modification of data
    • Fraudulent data changes
       
  7. Missing writes: Missing writes are another form of data corruption, but are much more difficult to detect and repair quickly. A lost or missing data block occurs when :
    • In the case of a lost write, the I/O subsystem has validated the writing of a block even though it has not been written to disk; consequently, the next read of this block will return an old version of it, causing a cascade of errors in processing and in the
      database.
    • In the case of a stray write, the writing is carried out, but at an incorrect location; as a result, the next time this block is read, an older version is returned, causing a cascade of errors in processing and in the database
    • In the case of an Oracle RAC database, reading a block on a node returns out-of-date data when another node has just written this block to disk (lost write). This can happen when NFS is used without the “noac” option.
       
  8. Block or slowdown: A block or slowdown occurs when the database or application is unable to process transactions due to a resource conflict or lock. The perception of a deadlock may be caused by a lack of system resources.
    • Application or database deadlocks
    • “Out-of-control” processes consuming system resources
    • Massive “storm” of connections or system errors
    • Application peak load situation with lack of system or database resources
    • Lack of space on ARCHIVE LOGS file destination or FRA (Flash Recovery Area) space

Causes of planned downtime

  1. Updating system or database software: a planned shutdown is either periodic (for maintenance tasks) or occasional (for system or database software or insfrastructure upgrade tasks). The duration of downtime depends on many factors. Here are a few examples:
    • Add or remove a processor from an SMP server
    • Add or remove nodes from a cluster
    • Add or remove disks or SANs
    • Modify configuration settings
    • Update or patch the server or operating system
    • Update or patch Oracle software
    • Update or patch application software
    • Migrate the hardware platform used
    • Move the database
    • Switch from 32-bit to 64-bit
    • Switch to a cluster architecture
    • Migrate to new storage
       
  2. Data modification: This is the case when changes are made to the logical structure or physical organization of Oracle database objects. These changes are often intended to improve performance or manageability. Here are just a few examples:
    • Modification to table definitions
    • Implementation of table partitioning
    • Creation or reconstruction of indexes
       
  3. Application changes: Application changes can include changes to data and database schema, as well as changes to programs.

Oracle offers various solutions to avoid both planned and unplanned downtime, and to cope with the various possible failures. These solutions will be developed in future articles.

Other articles in the category Data & AI

Federated Governance: A Key Pillar for Successful Data Mesh Implementation

Learn why federated governance is a critical organizational pillar in a Data Mesh architecture. A strategic issue for data-driven companies.

Read this article

Published on

12 December 2023

Top 10 Databases of 2020: Popularity Ranking

Explore the ranking of the top 10 most popular databases in 2020 according to DB-Engines, including Oracle, MySQL, and Microsoft SQL Server.

Read this article

Published on

14 November 2023

Our experts answer your questions

Do you have any questions about an article? Do you need help solving your IT issues?

Do you have any other questions? 

Call us free of charge on 8002 4000 or +352 2424 4000 for international calls, from Monday to Friday, 8 am to 6 pm.

About DEEP

Discover DEEP, your unique partner for your digital transformation.