Data Processing#

Introduction#

Data Processing transfers historical SQL Server data to ClickHouse, populating CDM and ODS tables. This ensures a complete historical view in your ClickHouse data warehouse.

Note

This procedure may also be executed using the system's UI. For more information, see System Monitoring.

When is Data Processing Required?#

Required - When migrating to MES version 11.1.3 or later.
Not Required - For new installations of MES version 11.1.3 or later, since, from this version, MES is designed to capture data going forward.

Warning

When migrating to MES 11.1.3 or later, you must run Data Processing on all existing SQL Server ODS datasets. If this step is skipped, CDM events will not be generated, because the system will remain blocked while waiting for unprocessed SHIDs.

Prerequisites#

Before starting, ensure you have:

MES SQL Server ODS Database Access - Permissions to connect.
Stored Procedure Execution Ability - Permissions to execute stored procedures in the ODS database.
Parameter Understanding - Familiarity with parameters of the Process Service History Ids (P_ProcessSHIDs) procedure (explained below).

Step-by-Step Instructions#

Follow these steps to initiate the data processing from ODS to Data Platform:

Step 1: Connect to the MES SQL Server ODS Database#

Use your preferred SQL Server tool (SSMS, Azure Data Studio) to connect.

Step 2: Open a New Query Window#

Open a query window for the script.

Step 3: Execute the `P_ProcessSHIDs` Stored Procedure#

Copy and paste the code below, then review and adjust parameters as needed.

DECLARE @RC INT
DECLARE @StartDate DATETIME = DATEADD(YEAR, -1, GETUTCDATE())                                   -- Process data from 1 year ago
DECLARE @EndDate DATETIME = GETUTCDATE()                                                        -- Process data up to now
DECLARE @ExecuteReplication BIT = 1                                                             -- Process replication data to ClickHouse ODS
DECLARE @ExecuteCDM BIT = 1                                                                     -- Process CDM data to ClickHouse CDM
DECLARE @Debug BIT = 0                                                                          -- Debug mode OFF
DECLARE @BlockSize INT = 10000                                                                  -- Process 10,000 transactions per batch
DECLARE @IgnoreQueuedCommands BIT = 0                                                           -- Ignore pending commands in the ODS Queue
DECLARE @SelectedTablesForReplication NVARCHAR(MAX) = 'CoreDataModel.T_CmfTimerHistory'         -- Select data only from T_CmfTimerHistory
DECLARE @SelectedEventsForCDM NVARCHAR(MAX) = 'CDM.Material.LossBonus,CDM.Calendar.CalendarDay' -- Select CDM data only from Material LossBonus and Calendar CalendarDay

EXECUTE @RC = [dbo].[P_ProcessSHIDs]
    @StartDate
    ,@EndDate
    ,@ExecuteReplication
    ,@ExecuteCDM
    ,@Debug
    ,@BlockSize
    ,@IgnoreQueuedCommands
    ,@SelectedTablesForReplication
    ,@SelectedEventsForCDM
GO

Step 4: Parameter Explanation and Customization#

Review and customize these parameters:

`@StartDate` and `@EndDate`#

Purpose - Defines the time range for data processing.
Default - No default for @StartDate; value must always be specified. Current UTC time for @EndDate.
Customization:
- Process all history - Set @StartDate to a very early date (for example, DATEADD(YEAR, -100, GETUTCDATE())). Consider processing time for large datasets.
- Process shorter period - Adjust DATEADD (for example, DATEADD(MONTH, -6, GETUTCDATE())) or specify dates (for example, '2023-01-01', '2023-12-31').

`@ExecuteReplication` and `@ExecuteCDM`#

Purpose - Flags to define the execution mode for data processing.
Default - No defaults; values must always be specified.
Customization:
- Population of ClickHouse ODS Tables - To enable this feature, set the @ExecuteReplication flag to 1; otherwise, set it to 0.
- Population of ClickHouse CDM Tables - To enable this feature, set the @ExecuteCDM flag to 1; otherwise, set it to 0.
Recommendation - For the initial data processing after an MES version migration, enable both flags to process all data. For increased performance when reprocessing specific historical data, consider including only replication or CDM data, in conjunction with a list of @SelectedTablesForReplication and @SelectedEventsForCDM, respectively.

`@Debug`#

Purpose - Enables debug output.
Default - 0.
Values:
- 0 - Debug OFF. Less output, recommended for production.
- 1 - Debug ON. More detailed logging for troubleshooting. Use cautiously in production due to potential performance impact.

`@BlockSize`#

Purpose - Sets transactions per processing batch.
Default - 10000.
Impact: Influences performance.
- Larger BlockSize - Potentially faster for large datasets, might increase memory usage.
- Smaller BlockSize - Lower memory usage, potentially slower execution.
Recommendation - Start with 10000. Adjust for very large databases or performance issues. Monitor SQL Server performance when changing.

`@IgnoreQueuedCommands`#

Purpose - Handles pending Command Queue commands during data processing. Command Queue manages ongoing online data replication.
Default - 0 (Error Enabled - Recommended). Stops with an error if pending commands exist in the Command Queue within the defined time range.
Values:
- 0 - Error Enabled (Default, Recommended). Aborts with error if pending Command Queue items are found in the defined time range to prevent inconsistencies.
- 1 - Error Disabled (Use with Caution!). Proceeds even with pending commands, potentially losing data from the Command Queue within the defined time range in ClickHouse ODS.
Impact and Considerations:
- Data Loss Risk (@IgnoreQueuedCommands = 1) - May omit data if pending Command Queue items exist for the processing period.
- Default "Error Enabled" Rationale - Safety measure; assumes pending commands should be processed. Forces investigation before potential data loss.
- When should you use 1 (extreme caution)? Rare troubleshooting, only if pending commands are confirmed irrelevant and causing failures. Generally not recommended. Investigate underlying issues instead.

`@SelectedTablesForReplication`#

Purpose - If set, limits the data processing to the selected tables.
Default - null (all tables included).
Customization:
- Include all tables (Default, Recommended) - Omit the parameter or set to null.
- Select specific tables (Use with Caution!) - Append comma-separated, fully qualified table names (for example, 'Schema.Table1,Schema.Table2').
Impact and Considerations:
- Important - Use fully qualified table names (Schema.Table).
- Service/Operation History tables are always included, independently of the input value.
- Impact on CDM data processing - The resulting table list also impacts the execution of the queries used to populate the ClickHouse CDM tables, as this parameter is considered in both execution modes (@ExecuteReplication = 1 and/or @ExecuteCDM = 1). If unsure of the impact, include all tables.
- When should you select specific tables? When there is replication data missing in ClickHouse ODS for only some tables, using this parameter will result in faster data processing. In all other cases, all tables should be included.

`@SelectedEventsForCDM`#

Purpose - If set, limits the CDM data processing to the selected events.
Default - null (all CDM events included).
Customization:
- Include all CDM events (Default, Recommended) - Omit the parameter or set to null.
- Select specific CDM events (Use with Caution!) - Append comma-separated, unique CDM event names (for example, 'Event1,Folder1.Folder2.Event1').
Impact and Considerations:
- Important - To list the available unique CDM event names, MES SQL Server Online Database Access is required, in order to query the IoTEventDefinitionName values of the Control.T_CDMExecutionMap table.
- Impact on replication data processing - This value does not impact the replication of data to the ClickHouse ODS tables.
- When should you select specific CDM events? When there is data missing in ClickHouse CDM for only some events, using this parameter will result in faster data processing. In all other cases, all events should be included.

Step 5: Execute the Procedure#

Select the Execute button in your SQL Server management tool to run the script.

Step 6: Monitor Execution (Optional)#

Monitor progress via @Debug output or SQL Server tools. Execution time varies with data volume and time range.

Post-Processing Expectations and Monitoring#

Background Processing - Data Platform processes data after procedure completion:
- Replication events for historical transactions.
- CDM events generation.
- Materialization to ClickHouse ODS/CDM datasets.
Time to Completion - Can be lengthy depending on:
- Time range (@StartDate, @EndDate).
- Historical data volume.
- Selected execution modes and filtered tables/events.
- Infrastructure performance.
Initial Data Visibility - Data gradually appears in ClickHouse ODS/CDM, so it is not immediate. It may be necessary to wait 30 minutes after executing P_ProcessSHIDs for Data Platform to start processing the data.
Monitoring Progress (Recommended):
- System Monitoring - Go to Administration and then System Monitoring in the environment UI and check the Lag counters in System/Database/Kafka. They should contain high values immediately after the P_ProcessSHIDs execution and they should eventually reach zero, when all data is populated in ClickHouse ODS/CDM. Note that these counters may continue to show lag for some time, depending on the system activity, as Online data for replication/CDM is captured automatically.
- ClickHouse Data - Check ODS/CDM tables for increasing data volume.
- Data Platform Logs - Review logs/dashboards for data processing progress/issues.
- SQL Server Activity Monitor (If Needed) - Check SQL Server performance during P_ProcessSHIDs execution.

Important Considerations#

Performance Impact - Can generate high load in SQL Server, especially during peak hours.
- Schedule Off-Peak - Run during low activity (nights, weekends).
- Monitor SQL Server - Track performance (CPU, memory, disk I/O). Pause or adjust @BlockSize if needed.
Execution Time - Expect potentially long runs for large datasets. Plan and communicate potential delays.
Data Consistency - Validate data integrity in ClickHouse post-sync.
Error Handling - Review SQL Server/Data Platform logs for issues. Check parameters, enable @Debug = 1, consult documentation/support if needed.
Planning & Communication - Inform teams about Data Processing execution, duration, and data availability in ClickHouse.

Troubleshooting (General Tips)#

Slow Execution:
- Check SQL Server performance by examining Dynamic Management Views (DMVs) to identify queries with high CPU or I/O consumption.
- Check Server level by using PerfMon or Task Manager to check for sustained high CPU usage (above 80%), low available memory, and high disk queue length.
- Analyze common wait types like PAGEIOLATCH (I/O bottleneck) and SOS_SCHEDULER_YIELD (CPU bottleneck) to pinpoint the root cause.
- Check the SQL Server Resource Monitor ⧉ page for other indicators to analyze.
- Increase @BlockSize incrementally.
- Minimize concurrent SQL Server processes.
Execution Errors:
- Check SQL Server error logs for P_ProcessSHIDs errors.
- Review @Debug = 1 output.
- Double-check parameters (table names, dates).
- Consult documentation/support.
Data Missing in ClickHouse:
- Check System Monitoring to ensure data processing is complete.
- Consider increasing the number of HouseKeeper replicas for faster processing if the Lag counters are very high and/or decreasing slowly.
- Check Data Platform logs for replication/materialization issues.
- Verify ClickHouse is running and accessible.
- Check @StartDate and @EndDate time range.
- Validate data exists in SQL Server for the time range.

Summary#

Data Processing is vital for populating Data Platform historical data. Follow this guide, configure parameters, and monitor progress to successfully transfer data to ClickHouse. Plan for execution time, performance impacts, and validate data post-sync.