The Architecture of High Availability

This page describes the underlying architecture of CA XOsoft's automatic failover solution, CA XOsoft™ High Availability (formerly CA XOsoft WANSyncHA). As a point of reference, you may want to also review these sections of our website:

CA XOsoft High Availability provides monitoring of a production application together with fully automated failover of applications across the WAN to a replica server in the event the master becomes unavailable. The failover may be triggered automatically or with the push of a button. Once the production server has been repaired, CA XOsoft High Availability can also perform an automated failback of the application to the original production server. How CA XOsoft High Availability accomplishes all this is the topic of this page.

The Need for a High Availability Solution

A replication solution is designed to maintain an exact replica of your production server, even as the data on the production server changes. In the event of production server failure, however, there is still significant effort involved in bringing the standby replica server online to replace the lost master.

Indeed, the design of CA XOsoft High Availability was driven by conversations with CA XOsoft customers who were using CA XOsoft Replication (formerly CA XOsoft WANSync) for replication and performing failover and failback manually. This close collaboration resulted in a tool that has set the standard for continuous application availability using an over-the-WAN replication solution. Not only does CA XOsoft High Availability automate the entire process, but it accomplishes this using a single service and GUI to handle every aspect of the process, from replication to failover and failback and even recovery. CA XOsoft was the first vendor to introduce a replication product that fully automates the entire process and remains the only WAN failover vendor to truly offer failover, failback and all related functions in a symmetric process requiring the use of only a single GUI and service.

In contrast to replication, there is no objective and externally determined standard for what an over-the-WAN high availability solution should provide. In approaching the problem of providing the high availability functionality, therefore, the CA XOsoft team developed a small set of core principles that guided the development of the product. Read about them here.

The Operation of Over-the-WAN High Availability

A CA XOsoft High Availability scenario incorporates all the functionality of a CA XOsoft Replication simple replication scenario, but adds three important new elements:

Pre-run verification of production and standby server configurations and environment
Production server and application monitoring
Push-button or auto-triggered fully automated failover and failback.

The first two elements are briefly described here, then the failover process is discussed at greater length.

Pre-Run Verification

Preparing to successfully fail over an application server across a WAN to a secondary standby server begins long before actual failover occurs. There are many things that can go wrong in the process of failing an application server over to a secondary server — there might be a problem with permissions, or with the way the application is configured, or the state of the application on the master or replica might be different from what CA XOsoft Replication is expecting.

Of course some of these things might change during the period between starting a high availability scenario and a failover. That is why it is extremely important to carry out regular testing, which CA XOsoft facilitates with its CA XOsoft Assured Recovery capability. Nevertheless, in many cases the problems that can cause failover to not succeed are already in place at the point when the scenario is started.

For this reason, when a high availability scenario is initiated, CA XOsoft Replication begins by performing an extensive list of checks, hundreds of them in fact, to determine whether any of the common issues that have been determined to cause problems during failover can be found. This phrase, common issues that have been determined to cause problems during failover, is not accidental. While many of the checks performed arise naturally from the design of the software, the CA XOsoft support team is always on the lookout for additional issues that should be added to the standard checks.

The checks performed are far too numerous to list here, but they fall broadly into three categories:

Consistency checks — ensure that all is as expected, network resources point to the current production server, the application is running only on there, configurations are consistent, etc.
Permission checks — verify that the engine has the authority to perform the actions it will need to perform during failover.
Application-specific tests — verify that configuration settings specific to the application being failed over are properly set, including Exchange, SQL, Oracle, IIS, and file servers.

Integrated testing and validation are critical to ensuring that failover will occur successfully and that the process of failover cannot cause problems in the environment.

Production Server and Application Monitoring

The second element that distinguishes CA XOsoft's high availability solution from a replication-only disaster recovery solution is automatic monitoring of the status of the production server and the application running on it. As soon as the scenario has started, the replica server performs a set of Is Alive checks of the master server on a regular (and configurable) basis.

There are three levels of Is Alive checking:

Ping. A ping request is sent to the Master server IP address in order to verify that the Master is accessible and alive.
Application Check. The replica server sends a request to the CA XOsoft engine on the master server to verify that the application is running. These checks involve verifying that appropriate services are running and, in the case of Exchange or a database application, connecting to the database and verifying that all datasets are in good state.
Customized Checks. A script may be registered to perform any additional, customized checks so that true application monitoring may be performed even on applications for which CA XOsoft does not already offer a customized solution.

If any of the checks performed returns an error, the entire check is considered to have failed. If then all checks fail throughout a configured timeout period (by default, 5 minutes), the master server is considered to be unavailable. Depending on how the scenario is configured, this will cause CA XOsoft Replication either to send notification of the problem to the administrator or to initiate failover.

To avoid triggering notifications or failover during planned downtime, Is Alive monitoring may be suspended manually from the management GUI.

The Switch-Over Process

There are two distinct cases of failover:

Clean failover

A clean failover is one that is deliberately triggered while both the master and the replica are fully functional. A clean failover might be performed in order to test the system, or in order to use the replica system to continue the application service while some form of maintenance is performed on the master server.
Emergency failover

An emergency failover is one in which the master production server has failed and is unavailable during the failover process. This is, of course, the typical mode in a disaster recovery situation.

The fact that the failover process is different for these two cases is a distinctive feature of the product. True automation requires that actions be tailored to the actual task being undertaken. The reason for this is not simply convenience; it is basic correctness of the procedure. This is the reason, for example, that CA XOsoft High Availability does not "lose" emails during a clean failover. As long as the master server is functioning and available, CA XOsoft High Availability can perform a clean failover by shutting down the application on the master, allowing all changes to be transferred to the replica, then starting up the application on the replica.

In the case of a "hard" failure, when the master server abruptly fails, some data loss is unavoidable, but has nothing to do with the replication/HA system. Such data includes transactions that were not fully completed by the application and must be rolled back, as well as data that was in transit to the master server at the time it failed.

Much of the failover process is, of course, common to the two types. We shall present the basic clean failover workflow first and then point out briefly where the emergency failover differs.

Clean Failover Workflow

Presented below are the failover steps from the standpoint of a forward failover from the production server to a standby server located elsewhere. The failback process is exactly the same, but with the server roles switched.

A clean failover is initiated manually from the management GUI. When the failover button is pressed, a command is sent by the CA XOsoft Replication manager to the master server. From that point, the failover proceeds through the following steps.

Master performs synchronization check. If synchronization of the files on the master and replica is in progress, failover is aborted since the backup server is not yet in sync with the production server.
Master stops application services. It is important to ensure that data on the master is no longer being updated before a clean failover can be performed. This may involve stopping an application like Exchange or Oracle, or simply removing access, for example, by removing network shares on a file server.
Replica signals successful completion of replication updates. Once the replica has applied all received changes, it signals the master server that it is ready for failover to commence.
Master releases network resources. On receipt of the signal from the replica, the master server "releases network resources." The specific meaning of this depends on the specific type of failover occurring. If DNS is being redirected, for example, this is a null operation. If the hostname or IP address is being moved, then this step either involves renaming the master to a temporary name or removing the IP address that is to be transferred. Similar actions are taken in the case that the server is a cluster. This point is the end of the role of the master server in the failover.
Replica adds network resources. Adding network resources is the opposite of releasing them, of course, As with release, the specific action depends on how failover is configured. For example, the A-record of the DNS entry for the master server may be updated to point to the replica, if DNS redirection is being used.
Replica optionally starts backward scenario. CA XOsoft High Availability can perform a failover so that no re-synchronization of data is required. This is accomplished through the use of a backward scenario.
Replica starts application services. In this final step, the replica server starts the application that is being failed over. As in the step where the master server stopped application services, this may involve starting an application like Exchange, SQL Server, or Oracle, or may simply mean adding network shares.

If an error occurs at any point during failover, the failover is aborted. If the failure occurs at any point before the attempt to start application services on the replica, CA XOsoft Replication will attempt to restore the master server.

Emergency Failover Workflow

If the master server is non-functional or only partly functional, the procedure above must obviously be modified. The simplistic approach would be to simply begin the procedure at step 5, the point immediately after the master server ceases to play a role. Far better, however, is to attempt to fix the problem and to perform the failover only if the attempt fails. This is the approach that CA XOsoft High Availability takes. The modified workflow for the case that the replica detects a problem with the master is as follows.

Replica attempts to connect to the master. The replica tries to contact the master server in order to fix the problem.
Upon successful connection, attempt to restart application or trigger failover. If the replica succeeds in contacting the master, it first issues a command to the CA XOsoft engine there to try to restart the application. If unsuccessful, a normal failover is attempted, beginning with the end of step 2 above.
If connection or normal failover are not successful, perform takeover. If the attempts of the previous steps do not succeed, the replica begins normal failover procedures beginning at step 5 above, thus unilaterally performing a takeover of the production application server's role.

Note that this workflow applies after failover has been triggered. How the triggering occurs is configurable by the administrator in one of two ways. Failover may be triggered automatically by CA XOsoft Replication when it detects that the master is unavailable. Alternatively, CA XOsoft Replication may simply alert the administrator to the problem. The administrator can then press the failover button on the management console to initiate takeover by the replica system. In either case, once failover is triggered, it follows the workflow described above.

The description here is deceptively simple, of course. Failover of a modern application server can be fairly or extremely complex, depending on the specific application and configuration.

Next topic: The Architecture of Assured Recovery