Well-Architected approach to CloudEndure Disaster Recovery

If there is an IT disaster, you must get your workloads back up and running quickly to ensure business continuity.

For business-critical applications, the priority is keeping your recovery time and data loss to a minimum, while lowering your overall capital expenses.

AWS CloudEndure Disaster Recovery helps you maintain business as usual during a disaster by enabling rapid failover from any source infrastructure to AWS. By replicating your workloads into a low-cost staging area on AWS, CloudEndure Disaster Recovery reduces compute costs by eliminating the need to pay for duplicate OS and third-party application licenses.

In this post, I walk you through the principal steps of setting up CloudEndure Disaster Recovery for on-premises machines in just few steps.

Cloud Endure overview :

CloudEndure Disaster Recovery is a Software-as-a-Service (SaaS) solution that replicates any workload from any source infrastructure to a low-cost “staging area” in any target infrastructure, where an up-to-date copy of the workloads can be spun up on demand and be fully functioning in minutes.

CloudEndure Disaster Recovery enables organizations to quickly and easily shift their disaster recovery strategy to public clouds, private clouds, or existing VMware-based Datacenter.   CloudEndure Disaster Recovery solution utilizes block level, Continuous Data Replication, which ensures that target machines are spun up in their most up-to-date state during a disaster or drill.

CloudEndure Disaster Recovery supports recovery from all physical, virtual, and hybrid cloud infrastructure into AWS.

Benefits of CloudEndure Disaster Recovery :

  • Average savings of 80% on total cost of ownership (TCO) compared to traditional disaster recovery solutions.
  • Sub-second Recovery Point Objectives (RPOs).
  • Recovery Time Objectives (RTOs) of minutes.
  • Multiple IT resilience options, ensuring a cost-effective strategy.
  • Support of all application types, including databases and other write-intensive workloads.
  • Automated failover to target site during a disaster.
  • Point-in-time recovery, enabling failover to earlier versions of replicated servers.
  • One-click failback, restoring operations to source servers automatically.

Continuous Data Replication :

At the core of the technology is a proprietary Continuous Data Replication engine, which provides real-time, asynchronous, block-level replication.

CloudEndure replication is done at the OS level, enabling support of any type of source infrastructure:

Physical machines, including both on-premises and colocation data centers Virtual machines, including VMware, Microsoft Hyper-V, and others.

Low-Cost “Staging Area” in Target Infrastructure :

Once the agent is installed and activated, agent begins initial replication, reading all of the data on the machines at the block level and replicating it to a low-cost “staging area” in the customer’s account in their preferred target infrastructure. Customers can provide their preferred target infrastructure as well as other replication setting such as subnets, VLANs, Security groups, replication tags. The “staging area” contains cost-effective resources automatically created and managed by CloudEndure to receive the replicated data without incurring any significant costs. These resources include a small number of VMs (each supporting multiple source machines), disks (one target disk for each replicating source disk), and snapshots.

The initial replication can take from several minutes to several days, depending on the amount of data to be replicated and the bandwidth available between the source and target infrastructure. No reboot is required nor is there any system disruption throughout the initial replication.

After the initial replication is complete, the source machines are continuously monitored to ensure constant synchronization, up to the last second. Any changes to source machines are asynchronously replicated in real-time into the “staging area” in the target infrastructure. Continuous Data Replication enables to continue normal IT operations during the entire replication process without performance disruption or data loss.

Automated Failback :

Once a disaster is over, agent provides automated failback to the source infrastructure. Failback technology also utilizes continuous data replication, failback to source machine is rapid and no data is lost during the process. Automated failback supports both incremental and bare metal restores.

How CloudEndure Disaster Recovery Works :

CloudEndure Disaster Recovery is an agent-based solution that continually replicates your source machines into a staging area in your AWS account without impacting performance. It uses Continuous Data Replication technology, which provides continuous, asynchronous, block-level replication of all of your workloads running on supported operating systems. This allows you to achieve sub-second recovery point objectives (RPOs). Replication is performed at the OS level (rather than at the hypervisor or SAN level), enabling support of physical machines, virtual machines, and cloud-based machines. If these is a disaster, you can initiate failover by instructing CloudEndure Disaster Recovery to perform automated machine conversion and orchestration. Your machines are launched on AWS within minutes, complying with aggressive recovery time objectives (RTOs).

Replication has two major stages :

  • Initial Sync: Once installed, the CloudEndure agent begins initial, block-level replication of all of the data on your machines to the staging area in your target AWS Region. The amount of time this requires depends on the volume of data to be replicated and the bandwidth available between your source infrastructure and target AWS Region.
  • Continuous Data Protection: Once the initial sync is complete, CloudEndure Disaster Recovery continuously monitors your source machines to ensure constant synchronization, up to the last second. Any changes you make to your source machines are replicated into the staging area in your target AWS Region.

Architecture of CloudEndure Technology :

Disaster recovery is a critical element of a company’s business continuity plan, enabling the quick resumption of IT operations and minimizing data loss if a disaster were to strike. Organizations often carry a high operational and cost burden to ensure continued operation of their applications, and databases in case of disaster. This includes operating a second physical site with duplicate hardware and software, management of multiple hardware or application specific replication tools, and ensuring readiness via periodic drills.

AWS’s CloudEndure Disaster Recovery makes it easy to shift your disaster recovery (DR) strategy to the AWS Cloud from existing physical or virtual data centers, private clouds, or other public clouds. CLOUDENDURE DISASTER RECOVERY is available on the AWS Marketplace as a software as a service (SaaS) contract and as a SaaS subscription. In this SaaS delivery model, AWS hosts and operates the CloudEndure Disaster Recovery application. As an additional component, CloudEndure Disaster Recovery uses an operating system level agent on each server that must be replicated and protected to AWS. The agent performs the block level replication to AWS.

The Well-Architected Framework has been developed to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for their applications. Based on five pillars: operational excellence, security, reliability, performance efficiency, and cost optimization. The Framework provides a consistent approach for customers and partners to evaluate architectures and implement designs that will scale over time.

This blog provides guidance on applying the AWS Well-Architected Framework’s five pillars and provides best practices on aligning your CloudEndure Disaster Recovery deployment to the pillars to ensure success.

Operational Excellence pillar :

Ensuring operational excellence when implementing CloudEndure Disaster Recovery begins with readiness.

Once you are familiar with CloudEndure Disaster Recovery , you should:

  1. Map your source environment into logical groups for DR. These groups may be based on location, application criticality, service level agreements (SLAs), recovery point objectives (RPO), recovery time objectives (RTO), etc. CloudEndure uses these groupings to prioritize and sequence recovery.
  2. Ensure that the requisite staging and target VPCs to be used by CLOUDENDURE DISASTER RECOVERY  are aligned to the Well-Architected Framework.
  3. Automate deployment, monitoring, and management of CLOUDENDURE DISASTER RECOVERY using the available CLOUDENDURE DISASTER RECOVERY API. Use the monitoring and logging data to monitor the operation status of CLOUDENDURE DISASTER RECOVERY and inject the data into existing centralized monitoring tools.

Security pillar :

  1. The Well-Architected Framework’s security principles should be applied at multiple layers when deploying CLOUDENDURE DISASTER RECOVERY . Design principles, including strong identity protection, traceability, all-layer network security, and data protection, should be extended to CLOUDENDURE DISASTER RECOVERY.
  2. To automate orchestration and recovery, CLOUDENDURE DISASTER RECOVERY uses the AWS API via IAM user credentials with programmatic access.
  3. CLOUDENDURE DISASTER RECOVERY agent uses an HTTPS connection to the User Console, which is used for management and monitoring. The User Console stores metadata about the source server in an encrypted database. The source data is replicated directly from source infrastructure to the target AWS account. While the connection can be public or private, it is recommended to use a private connection rather than the public internet. Enabling a private connection, and disabling the allocation of public IPs for CLOUDENDURE DISASTER RECOVERY replication servers should be set under replication settings.
  4. To ensure data security, CLOUDENDURE DISASTER RECOVERY encrypts all data replication traffic from the source to the staging area in your AWS account using the AES-256 encryption standard. Data at rest should be encrypted using the AWS KMS service. CLOUDENDURE DISASTER RECOVERY should be set to use the appropriate KMS key for EBS encryption under replication settings.

Reliability pillar :

  1. The best mechanism to ensure reliability in case of a disaster is to regularly validate and test recovery procedures. CLOUDENDURE DISASTER RECOVERY enables unlimited test launches allowing both spot testing, and full user acceptance and application testing. It is critical to test launch instances after initial synchronization to confirm availability and operation.
  2. Automate recovery by monitoring key performance indicators (KPIs), which vary by organization and workload. Once the KPIs are identified, you can integrate CLOUDENDURE DISASTER RECOVERY launch triggers into existing monitoring tools. An example of monitoring and automating DR using CLOUDENDURE DISASTER RECOVERY is detailed in this AWS blog.
  3. The CLOUDENDURE DISASTER RECOVERY User Console is managed by AWS and uses Well-Architected principles ensuring reliability and scalability. DR and redundancy plans are in place to ensure availability of the CLOUDENDURE DISASTER RECOVERY User Console. CLOUDENDURE DISASTER RECOVERY is leveraged to replicate its own User Console using a separate and isolated stack. This ensures recoverability of the User Console if the public SaaS version becomes unavailable.

Performance Efficiency pillar :

  1. As a SaaS solution, CLOUDENDURE DISASTER RECOVERY uses the performance efficiency pillar to maintain performance as demand changes. The replication architecture uses t3.small replication servers that can support the replication of most source servers. At times, source servers handling intensive write operations may require larger or dedicated replication servers. Dedicated or larger replication servers may be selected with minimal interruption to replication and be used for periods of time, for example to reduce initial sync times.
  2. To meet RTOs, typically measured in minutes, CLOUDENDURE DISASTER RECOVERY defaults target EBS volumes to Provisioned IOPS SSD (IO1) in the blueprint. During the target launch processes, an intensive I/O re-scanning process of all hardware and drivers, due to changing the hardware configuration, may occur. IO1 volumes reduce this impact to RTO. If IO1 volumes are not required for normal workload performance, we recommend that the volume type be programmatically changed after instance initialization. Alternatively, Standard or SSD volume types may be selected in the blueprint before launch. To ensure they meet your RTO requirements, be sure to test launch the various volume types.
  3. While CLOUDENDURE DISASTER RECOVERY encrypts all traffic in transit, it is recommended to use a secure connection from the source infrastructure to AWS via a VPN or AWS Direct Connect. The connection must have enough bandwidth to support the rate of data change to support ongoing replication, including spikes and peaks. Network efficiency and saturation may be impacted during the initial synchronization of data. CLOUDENDURE DISASTER RECOVERY agents utilize available bandwidth when replicating data. Throttling can be used to reduce impact on shared connections, and can be accomplished using bandwidth shaping tools to limit traffic on TCP Port 1500, or using the throttling option within CLOUDENDURE DISASTER RECOVERY . Considerations for throttling should include programmatically scheduling limits to avoid peak times, and understanding the impact of throttling to RPOs. It is recommended that all throttling be disabled once the initial sync of data is complete.

Cost Optimization pillar :

  1. CLOUDENDURE DISASTER RECOVERY takes advantage of the most effective services and resources, to achieve RPO of seconds, and RTO measured in minutes, at a minimal cost. The type of resources used for replication can be configured to balance cost, and RTO, and RPO requirements. For replication, CLOUDENDURE DISASTER RECOVERY uses shared t3.small instances in the staging area. Using EC2 Reserved Instances for replication servers is a method to reduce costs. Each shared replication server can mount up to 15 EBS volumes. In the staging area, CLOUDENDURE DISASTER RECOVERY uses either magnetic (<500 GB) or GP2 (>500 GB) EBS volumes to keep storage costs low. CLOUDENDURE DISASTER RECOVERY provides the ability to decrease storage costs further by using ST1 (>500 GB) EBS volumes. The use of ST1 EBS volumes can be configured in the replication settings, however this may impact RPOs and RTOs.
  2. In the event of a disaster, CLOUDENDURE DISASTER RECOVERY triggers an orchestration engine that launches production instances in the target AWS Region within minutes. The production instances’ configuration is based on the defined blueprint. Using the appropriate configuration of resources is key to cost savings. Right sizing the instance type selected in the blueprint ensures the lowest cost resource that meets the needs of the workload. Selecting appropriate EBS volume type for root and data volumes has a significant impact on consumption costs. TSO Logic, an AWS company, is a service that provides right sizing recommendations, and can be imported into the CLOUDENDURE DISASTER RECOVERY blueprint.

Conclusion :

In this post I reviewed AWS best practices and considerations for operating CLOUDENDURE DISASTER RECOVERY in AWS. Reviewing and applying the AWS Well-Architected Framework is a key step when deploying CloudEndure Disaster Recovery .

It lays the foundation to ensure a consistent and successful disaster recovery strategy. If disaster were to strike, implementing the concepts presented in the Operational, Security, and Reliability pillars supports a successful recovery. CLOUDENDURE DISASTER RECOVERY allows recoverability for your most critical workloads, while decreasing the total cost of ownership of your DR strategy.

Leave A Comment

You must be logged in to post a comment.