Disaster Recovery (DR) Powered by Continuous Data Protection (CDP)

In Disaster Recovery (DR), two critical factors are minimizing downtime and preventing data loss. These are measured by RTO (Recovery Time Objective), which defines how quickly operations can resume, and RPO (Recovery Point Objective), which measures the volume of data that could be lost.
Backups are necessary for any IT environment, but they must be complemented by a Disaster Recovery solution to meet more strict RPO and RTO recovery demands for critical applications and data, bypassing the backup window. Even newer mechanisms present in backup solutions, such as instant recovery, cannot meet strict Disaster Recovery´s RPO and RTO demands.
As a reference, the following table compares the RPO, RTO, and SLA of various data recovery mechanisms, including recovery from image-level backups and failover replicas.

Continuous Data Protection (CDP)
Continuous Data Protection is the first-choice technology that achieves RPOs in seconds (near-zero RPO) and low RTOs.
CDP works by continuously capturing and monitoring data changes in real-time, including any data written, in the client’s production site (tenant) and automatically replicating every data version to a recovery site or cloud; the write-order fidelity of all the block-level changes is maintained.
It stores these changes in a journal. If restoration is required at a specific time, the journal’s changes can be reverted to that point in time.
Continuous Data Protection also does not limit the physical distance between the tenant production and recovery environments; the main requirements are minimum guaranteed bandwidth and controlled latency.
It is essential to know that CDP is not equal to hypervisor-based asynchronous replication, like Veeam Replication and VMware vSphere Replication, or even near-sync snapshot-based replication, which is present in some HCI systems like Nutanix.
The main difference is that the CDP uses APIs to capture and log every I/O in real-time. When deployed in VMware environments, most CDP solutions integrate with VSphere APIs for the IO Filtering (VAIO) framework.

On the other hand, in asynchronous and near-sync replication, the data capture frequency is not continuous; the data is captured and replicated in short intervals by crash-consistent or app-consistent snapshots.
Asynchronous Replication
When you take snapshots of multiple VMs simultaneously, each VM requires a certain amount of time to capture the state of its disk, memory, and configuration. If the snapshots are taken in parallel, this can delay completing the snapshot process for all VMs.
The time it takes to process each snapshot, read the data, replicate it, and commit snapshots, especially for VMs with extensive data change rates, can result in a longer replication cycle.
Due to this, vDisk-based snapshot frequency is traditionally restricted to intervals no shorter than 15 minutes. This interval ensures the replication process has enough time to capture and synchronize changes to the source data, reducing the risk of inconsistencies or corruption during the replication process.
Below is Veeam’s asynchronous replication architecture for VMware Vsphere.

It is based on VM snapshots replicated asynchronously. You can configure replication jobs using crash-consistent or application-consistent snapshots.
In step 6, we can verify that Veeam Backup & Replication requests the vCenter Server or ESXi host to create a VM snapshot.
The VM disks are set to a read-only state, and each virtual disk receives a delta file. Any changes to the VM during replication are written to these delta files.
The source proxy starts the replication, reading the VM data from the read-only VM disk.
The great advantage of CDP overall types of snapshot-based replication is that it significantly lowers the impact on system performance, particularly in primary I/O latency and replication processes.
Another positive point of CDP is that, like asynchronous replication, it does not limit the physical distance between the sites; the main requirement is a guaranteed minimum bandwidth.
Near-Sync Replication from HCI systems
The snapshot mechanism remains on near-synchronous replication solutions of some HCI systems.
The difference in Nutanix’s case is the use of crash-consistent Lightweight Snapshots (LWS), which allow RPO between 1 and 15 minutes. However, there are limitations, such as the HCI system locking and the inability to perform cross-hypervisor disaster recovery.
Unlike the traditional vDisk-based snapshots used by asynchronous replication, Nutanix LWS leverages markers and is completely OpLog-based. In Nutanix, vDisk snapshots are done in the Extent Store portion of the system (Persistent Data Storage).
In this case, Oplog performs a function similar to a filesystem journal. It is built as a staging area to handle bursts of random writes, coalesce them, and then sequentially drain the data to the Extent Store. OpLog is stored on the SSD tier on the CVM to provide extremely fast write I/O performance, especially for random I/O workloads.

When Near-sync replication is enabled, v-Disk-based seed snapshots are taken and replicated to the remote site before LWSs begin. LWSs are replicated continuously to the remote site. The system creates an intermediate snapshot every hour and retains it for 6 hours as a checkpoint to help with RTO.

VMware VAIO Architecture
Below is a CDP architecture for VMware environments using vSphere APIs for IO filtering (VAIO).

The VAIO framework offers a secure method for integrating third-party software into a VMware environment. It allows it to intercept data as it flows between virtual machines and virtual disks and perform services on that data, such as replication.
A key component in this framework is I/O filters. I/O filters are software components that can be installed on ESXi hosts to provide additional data services to virtual machines. As presented previously, these filters process I/O requests as data moves between a virtual machine’s guest operating system and the virtual disks. We can enable I/O filtering for an individual virtual disk, and they are entirely independent of the storage tier.
VMware offers specific categories of I/O filters, and third-party vendors can create them. Typically, they are distributed as packages that provide an installer to deploy the filter components, as with the Veeam solution. The I/O filter itself is installed during the Continuous Data Protection configuration process.
For example, refer to the following procedure to install Veeam I/O Filters:
https://helpcenter.veeam.com/docs/backup/vsphere/cdp_io_filter_install.html?ver=120
Once the I/O filters are deployed, the vCenter Server configures and registers an I/O filter storage provider, also known as a VASA provider, for each host in the cluster. These storage providers communicate with the vCenter Server and make the data services provided by the I/O filters visible in the VM Storage Policies interface. After associating virtual disks with the storage policy, the I/O filters are enabled on those virtual disks.
In the VAIO framework, User Space and Kernel Space are two separate areas of memory where I/O filters operate. Each serves distinct purposes and offers different levels of access and control.
- User Space is where user-level applications, including third-party I/O filters, operate. In the VAIO framework, I/O filters developed by third-party vendors typically run in user space. This area is isolated from the core operating system to prevent direct manipulation of the hardware or Kernel, thereby ensuring system stability and security.
- Kernel Space is where the VAIO framework runs and handles the low-level aspects of I/O filtering. It is responsible for processing I/O requests from virtual machines to virtual disks. The kernel space interacts with user space to execute the filtering logic implemented by third-party I/O filters, like Veeam’s I/O filter.
The I/O path with VAIO looks like:
Step 1: A write comes from a guest OS and is handled in User Space by the vSCSI (Virtualized SCSI Interface) of the VM
Step 2: The vSCSI driver opens a channel to the vSCSI backend in the Kernel Space, which processes the write by opening a location on the File System layer.
Step 3: The File System layer then hands the write to the File Device layer.
Step 4: The VAIO framework can see the IO request before the File Device sends the write to a physical device, where the virtual disk is hosted. The VAIO framework has visibility via a kernel module attached to the File Device layer, and it can see if that VM’s IO has a particular data service attached to it.
Step 5: If there is no policy, the IO commits without filtering and overhead. If there is a policy for that VM’s IO, it is passed back to the User Space of the requesting VM, where the data service executes the filter against the I/O.
Step 6: If the policy is for replication, the filter records the changes in a change log in memory or cache. The filter then passes the captured write data to a Third-Party CDP service for processing and replication.
Step 7: The filter returns the write I/O directly to the physical device without needing to return through the entire vSCSI/File System layers again. The changes are applied directly to the Virtual Disk.
Upcalling IO back to User Space from Kernel Space may seem like a time-consuming process, but this is a specifically designed mechanism that takes microseconds to accomplish. This architecture guarantees minimal overhead, security, and stability for the Kernel.
Conclusion
Organizations are under increasing pressure to reduce downtime and safeguard against data loss and application interruptions, ensuring that critical systems and data are consistently protected and can be quickly restored.
Disaster Recovery powered by Continuous Data Protection (CDP) enables a real-time data protection strategy that saves a copy of every change made to data, which can be restored at any point-in-time in case of system rollback needs. Continuous Data Protection is a replication technology that achieves in seconds (near-zero RPO) and RTOs in minutes.
Continuous Data Protection’s key strength lies in its real-time block-level data capture and replication, which utilizes APIs and integrates with technologies such as VMware’s VAIO framework using I/O Filters. This enables efficient and seamless recovery for the organizations, even across geographically distributed environments.
References
https://blogs.vmware.com/virtualblocks/2015/02/05/vsphere-apis-for-io-filtering
Discover more from CloudnRoll
Subscribe to get the latest posts sent to your email.
