SAP on AWS DR Test Observations

Mark Ross
6 min readMar 27, 2021

--

(Originally published as a LinkedIn article 5th January 2021)

File photo: Reuters

I recently worked with a customer to upgrade their SAP estate, replace the Oracle database layer with HANA and migrate from on-premise to AWS. I thought I’d share some specific learnings when we simulated the failure of Availability Zones in AWS.

The solution comprised of a deployment within a single AWS Region, using multiple availability zones. There were 10 SAP applications, the premise was to design highly available, load balanced, applications servers spread across the AZs, with RHEL Pacemaker Clusters providing the HANA database layer in an Active/Passive capacity with failover. The exception to this was SAP BW, which was full-size in a single AZ, with ‘pilot light’ in the secondary AZ, for cost reasons. A multitude of AWS native services were used including WAF, Load Balancing, Route53 and FSx (for Windows CIFs shares) and FSX (for Linux NFS shares) for the shared file systems.

The customer required proof of a disaster recovery test prior to the migration to AWS and service commencement. I know for immutable environments there would be fun and interesting ways to test things out (choas monkey at the ready!), however SAP is about as far from immutable as I’ve seen. We hit upon proving availability via a combination of simulated failure of individual instances (load balancing continued, clusters failed over and continued operation), this occurs regularly as instances are patched etc. and simulating entire AZ failure. What I found really interesting were the results of the simulation of an entire AZ failure, many of the results we witnessed were fairly logical, however it might not be something you’d think would happen until you see it in front of you and have the realisation of what’s happened. Hopefully these observations are useful if you find yourself in a similar scenario.

To simulate the failure of an entire AZ the following tasks were undertaken: -

  • Block NACLs to subnets in the AZ being simulated as lost

The following unexpected behaviour was observed: -

We lost all connectivity to the region, not just the AZ we simulated as being ‘failed’.

The root (or should that be ‘route’ :-) ) cause of this behaviour is how the Transit Gateway routes traffic from on premise (via a VPN over DX) into the VPC. Traffic flows of this nature do not enter the VPC via the Transit Gateway attachment within the AZ for which the traffic is destined, instead the routing is handled by one Transit Gateway attachment. It is not possible to set preferred routes to control this behaviour, traffic could be sent via any Transit Gateway attachment. The summary is this is a limitation with the testing possible, and in the event of a true AZ failure the Transit Gateway attachment in the AZ shouldn’t still be live and therefore traffic would be re-routed elsewhere, to simulate this we temporarily removed the Transit Gateway attachment from the ‘failed’ AZ

Split brain in Pacemaker Clusters (impacted SAP HANA availability)

When failure was simulated via the ‘deny all’ NACLs it was observed that there was a random selection of which AZ each pacemaker cluster considered to be ‘live’. Some clusters considered the correct AZ to be live and shutdown their nodes in the ‘failed’ AZ (desired behaviour), whilst other clusters considered the ‘failed’ AZ to be live and shutdown the nodes that were supposed to be live (causing database downtime).

The root cause of this behaviour is that when the Pacemaker clusters lose network connectivity to each other each node independently tries to decide which is live, as each one can be independently be active or passive across the AZs in use. They use a combination of things to make this decision and if they consider themselves to be the ‘live’ node they ‘fence’ the other node by using the AWS API to power off the ‘lost’ node, in our case via a VPC endpoint. Our test conditions created a split brain and we saw random behaviour as a result.

In a real DR this shouldn’t be an issue as the nodes in the failed AZ won’t be there, but to avoid ‘split brain’ in testing we needed to ensure the ‘failed’ nodes couldn’t reach the EC2 VPC endpoint to fence the other nodes. This could be achieved through SGs on the VPC endpoint, or powering off the instances.

EFS Availability

EFS is a highly available service, and with mount points in each AZ the service for the systems within the VPC remained available, as long as you use the DNS entry for mounting (Route53 handles ensuring services in the AZ get the mount point in their AZ). However AWS recommendations outside the VPC are to use IP address for mount, which clearly doesn’t work for AZ simulation failure so is one to be aware of.

Unable to test FSx due to limitation in AWS Service

Amazon FSx is an AWS managed service, with availability of FSx controlled by making it a single or multi-AZ solution (the latter is active/passive). When failure of the AZ where FSx was passive was simulated there were no issues. When failure of the AZ where FSx was active was simulated systems lost access to the FSx.

The root cause of this behaviour is that the FSx is the design of the service. Only the primary node is available to serve traffic, the standby node is inaccessible. When the AZ that hosted the primary node was simulated as failed DNS was still pointing to the primary node, but it was inaccessible, behind the deny all NACL. The FSx didn’t’ fail over to the other AZ, because the FSx service was operating normally and still available, albeit nothing was connecting to it. It’s not possible to control failover of FSx from a user perspective.

If the entire AZ failed for real (e.g. fire, flood etc.) the FSx service would have failed over to the standby node. There are edge cases where all AZs could be online but network connectivity could be impaired (e.g. multiple severed network links) where this behaviour could be observed for real, however due to link diversity / resilience etc. built into the AWS backplane this is considered highly unlikely. AWS also advised altering throughput forces a failover and failback so this could be used for some form of test, although probably difficult to align with wider DR testing as likely to occur fairly quickly.

We retested with the following steps and we got the behaviour we expected, save for not being able to test FSx: -

  • Power off EC2 instances in the AZ being simulated as lost
  • Remove the Transit Gateway Attachment in the AZ being lost
  • Deny access to VPC endpoints in the AZ being simulated as lost
  • Block NACLs to subnets in the AZ being simulated as lost

--

--