Disaster Recovery#
This guide targets Operations teams running the MES Data Platform. It clarifies what to back up, when you must restore, and how to rebuild the platform with minimal data loss and downtime.
The following scenarios illustrate when and how to restore analytics data from SQL Server and other sources:
- Initial Sync During a Migration - Use Data Processing to seed/re-seed analytics stores from SQL Server. For more information, see Data Processing.
- Disaster Recovery When ClickHouse Data Is Lost - Rebuild ClickHouse from authoritative sources and available backlogs. For more information, see ClickHouse Database Backup and Restore.
Backup Requirements#
As of 11.x, the MSSQL databases and SMB shares are the critical backups. ClickHouse, classified as a critical component, plays a key role in minimizing recovery time and reducing data loss. While Kafka and S3 backups are not considered critical, they can also significantly improve recovery efficiency and help preserve data. The table below summarizes the relevant information.
| Component | Examples | Role in Recovery | Backup Priority |
|---|---|---|---|
| MSSQL Databases | Online, ODS, DWH | Authoritative system-of-record for business data and historical facts used to rebuild analytics. | Critical |
| SMB Shares | Deployment bundles, configuration, shared folders | Required to restore platform configuration, packages, and artifacts. | Critical |
| ClickHouse Databases | Online, ODS, CDM, DWH | High performance storage for the Operational Data Store (ODS) and all analytics information. Can be rebuilt from MSSQL and event backlogs. | Critical |
| Kafka Topics | For example,{systemname}_workqueue_datamanager | Event backlog for async processing; enables replay to reduce data gaps. | Recommended (reduces RPO) |
| Object Storage (S3) | Raw/landed data, files referenced by pipelines | Used to rehydrate/replay pipelines and fill analytical gaps. | Recommended (reduces RPO) |
Table: Backup components, examples, and their role in disaster recovery
Consistency Expectations#
- There is no cross-system point-in-time (PIT) guarantee covering SQL Server, ClickHouse, Kafka, and S3 simultaneously - and it is not required for a successful recovery.
- Aim for best-effort alignment, then follow the restore order below to converge the platform to a consistent state.
Recommended Restore Order#
The recommended restore order follows platform dependencies: start with the system of record and configuration, then replay backlogs, and finish by rebuilding the analytics layer. Follow these steps:
- Restore MSSQL databases (Online, ODS, DWH)
- Restore SMB shares
- Restore Kafka topics
- Restore S3 objects
- Restore ClickHouse databases (Online, ODS, CDM, DWH)
ClickHouse Data Recovery and Restoration#
If ClickHouse Backups Were Not Available Before the Incident
If your SQL Server backups are available but ClickHouse backups were not created - and it is no longer possible to generate matching backups from the source system - follow the procedure described in the Re-Create ClickHouse Backups guide. This process allows you to rebuild compatible ClickHouse backups and restore analytics data consistently before proceeding with disaster recovery.
Before you start#
- Communicate the incident and freeze inbound changes (pause or throttle heavy writers where feasible).
- Confirm you have the latest MSSQL and SMB backups.
- If available, identify Kafka offsets and S3 object versions you intend to use for replay.
Recovery workflow#
- Restore MSSQL (Online, ODS, DWH) from the chosen backups.
- Restore SMB shares to recover deployment/configuration artifacts.
- Optionally restore Kafka (or pin consumer groups to desired offsets) to make event replay available.
- Optionally restore S3 objects if your pipelines require historical files.
-
Recreate ClickHouse schemas and data:
- Use Data Processing to repopulate ClickHouse from SQL Server where applicable.
- Replay from Kafka/S3 if your architecture uses these as backlogs for IoT/async datasets.
-
Invalidate OData models so clients see the rebuilt data:
- Ensure Message Bus notifications are flowing so Data Manager invalidates cached OData models, or restart Data Manager if needed.
-
Resume producers/consumers and normal workloads in a controlled manner.
Post-recovery validation#
- Spot-check key facts: record counts by day, min/max timestamps, and a small set of business-critical queries in MSSQL vs ClickHouse.
- Verify OData responses (schema and sample data) and UI dashboards.
- Confirm background jobs, consumers, and IoT event ingestion have stabilized (no growing backlogs, healthy throughput).
Notes on RPO/RTO#
- The Recovery Point Objective (RPO) primarily depends on your MSSQL backup cadence plus availability of Kafka/S3 backlogs for replay.
- The Recovery Time Objective (RTO) improves significantly if you also back up ClickHouse (you may restore instead of fully rebuilding via Data Processing).
- Align backup frequency and retention with business impact for the most critical datasets and topics.