Monitoring#

It's important to monitor the system health to detect problems before they occur, and also in the case that a problem occurs, to be able to quickly identify the root cause of the problem.

Info

Critical Manufacturing offers an Observability service that includes access to a comprehensive set of dashboards that includes logs, metrics and traces of all Critical Manufacturing Platform components.

For more information please contact your Account Manager or visit https://portal.criticalmanufacturing.com/Help/

To monitor properly it's important to have a clear understanding of its Architecture (TODO: link arch) and Design Principles.

The following sections will focus on the primary components of the system as well as some of its requirements:

Application-Level Cluster
Database Servers
Kafka Cluster
ClickHouse Cluster

In addition it includes a section that describes a Recommended Monitoring Strategy.

Application-Level Cluster#

Depending on your Kubernetes (K8s) distribution, or the infrastructure you're running it into, you may have already a set of tools to monitor that gives you access to some performance indicators that are relevant to monitor.

We recommend you to check with you K8s provider what monitoring tools you already have available.

Each customer's environment has a different monitoring signature, depending on the business processes the MES is supporting, how they are modeled, and the different modules in use. Because of this, it's important that an initial monitoring base-line is taken in after the initial weeks of production.

Database Server#

This section covers the following topics:

Critical Manufacturing databases and database instances
Data Collection
SQL Server Management Studio
Monitoring Events
Database Server Performance Counters

Critical Manufacturing Databases and Database Instances#

A typical Critical Manufacturing MES installation consists of two Microsoft SQL Server^TM Database instances (Online and ODS).

The Online SQL Server Instance stores the MES database used for the runtime environment. The ODS SQL Server Instance stores the ODS and DWH databases used for long time storage, reporting and analytics.

Data Collection#

Microsoft SQL Server^TM provides a data collector that can be used to obtain and save data that is gathered from several sources. The data collector enables you to use data collection containers, which enable you to determine the scope and frequency of data collection on a computer that is running SQL Server.

For more information, refer to the Data Collection ⧉ page.

SQL Server Management Studio#

SQL Server Management Studio is a administration tool that enables the management of the Database Engine, Analysis Manager, and SQL Query Analyzer in one place, together with the ability to write Transact-SQL, MDX, XMLA, and XML statements.

Special Administration Tools#

Perhaps one of the most interesting utility provided by SQL Server Management Studio at the instance level is the Activity Monitor. The Activity Monitor is accessible by connecting to the database instance with Administration privileges, right-clicking on the instance and selecting Activity Monitor. It provides a comprehensive dashboard about the status and health of the database instance as shown in the next image.

SQL Server Management Studio also provides a variety of reports at both the instance level and database level. It is only necessary to right-click on the instance or database, select the desired report as shown in the next picture.

Info

These type of reports exist at both the instance level and database level. The number and type of reports vary depending on whether the selected object is an instance or a database.

The image below is a screenshot of an instance level Server Dashboard report.

For more information, refer to the Microsoft SQL Server Management Studio ⧉ page.

Monitoring Events#

SQL Server^TM provides SQL Trace and event notifications to monitor events that occur in the Database Engine.

For more information, refer to the Monitoring Events ⧉ page.

Database Server Performance Counters#

Similar to the Application Server Performance Counters it is possible to instrument and monitor the hardware usage and performance of the Database Servers. Because databases are I/O intensive, it is also worth checking the network and disk activity. Some important I/O performance counters are:

Category	Counter	Description
Network Interface	`Bytes Total/sec`	Monitors the network activity
PhysicalDisk	`% Disk Time`	Monitors the disk activity.
PhysicalDisk	`Current Disk Queue Length`	Monitors the disk queue length – a big number indicates contention on the disks

Table: Database Server I/O Counters

In addition, SQL Server installs an array of database specific counters that can be used to monitor different performance indicators associated with SQL Server. Some of the typical Performance Counters that are useful to monitor a SQL Server database are:

Category	Counter	Description
SQL Server	Databases	Provides information about a SQL Server database, such as the amount of free log space available or the number of active transactions in the database. There can be multiple instances of this object
SQL Server^TM	General Statistics	Provides information about general server-wide activity, such as the number of users who are connected to an instance of SQL Server
SQL Server^TM	Locks	Provides information about the individual lock requests made by SQL Server, such as lock time-outs and deadlocks. There can be multiple instances of this object
SQL Server^TM	Memory Manager	Provides information about SQL Server memory usage, such as the total number of lock structures currently allocated

Table: Database Server Performance Counters

A comprehensive list of SQL Server performance counters can be found in the Using SQL Server Objects ⧉ page.

Recommended Monitoring Strategy#

This section describes the Critical Manufacturing MES's recommended monitoring strategy for a production environment based on best practices. The best practices can be summarized in the following two guidelines:

Actively deploy a commercial monitoring system which supports the Simple Network Management Protocol (SNMP) to automate the monitoring procedures and to generate automatic system alerts.
Monitor the trend of the system resources usage and system performance in a proactive and periodic manner

Deploying a Commercial Monitoring System#

A commercial monitoring system facilitates the configuration and automatic monitoring of the health of multiple hardware and software components which make part of a distributed application.

The next sub-sections describe the type of monitoring recommended for Application and Database Servers. There's also a small sub-section for the ERP integration monitoring in case it is being used.

Application Servers#

The set of components listed in the next table be should be subjected to automatic monitoring at each application server.

Components	Description	Healthy State
Critical Manufacturing («CM MES instance») Host Service State	This is the main Critical Manufacturing MES application server process	Running
Critical Manufacturing Remote Import/Export Gateway («CM MES instance»)	This process is used for exporting and importing data with other CM MES instances. It can be critical if export/import between CM MES instances is used, otherwise it’s not critical	Running
Internet Information Server (IIS) Service State	This is the web server process	Running
The IIS Web Site associated with the Critical Manufacturing MES application (e.g.: Default Web Site) Status	This is the web site hosted by the web server	Started
Microsoft Distributed Transaction Coordinator Service State	This is a critical component required for transaction handling	Running
Microsoft Message Queuing Service State	This is a critical component required for messaging	Running

Table: Application Server components

In addition, it’s recommended to make the checks listed in the table below to ensure the availability and responsiveness of the system even in case that all components are up and running.

Test	Test Procedure Description
Test the Critical Manufacturing MES service responsiveness at a particular server.	Make a get (test) API call to the Critical Manufacturing MES application server. The server must respond correctly within a reasonable amount of time.

Table: System availability and responsiveness

The set of performance counters listed in the next table should be included in the automatic monitoring system for the application servers.

Area	Parameter	WMI Counter	Healthy State
CPU	CPU Utilization	`Processor\% Processor Time (_Total)`	< 80% (average)
CPU	Number of threads in the processor queue	`System\Processor Queue Length`
Disk	Disk Transfers	`Logical Disk\Disk Transfers/sec (_Total)`	Under 80% of its capacity (average)
Disk	Free Disk Space	`Logical Disk\% Free Space (_Total)`	> 20%
Memory	Available Physical Memory	`Memory\ Memory\Available Mbytes`
Memory	Memory hard page faults	`Memory\Pages per Second`
Memory	Page File Utilization	`Paging File\ % Usage (_Total)`	< 80% (average)
Memory	Virtual Memory Utilization	`Memory\% Committed Bytes in Use`	< 80% (average)
MSMQ	Microsoft Message Queue Journal and Dead-Letter Queue	`MSMQ\Messages In Queue (Computer Queues)`
Network	Network Interface\Bytes Total/Sec	`Network Interface\Bytes Total/Sec`	Under 80% of its capacity (average)
Network	Network Interface\Output Queue Length	`Network Interface\Output Queue Length`
Transaction Management	Aborted Transactions	`Distributed Transaction Coordinator\Aborted Transactions / sec`
Transaction Management	Committed Transactions	`Distributed Transaction Coordinator\Committed Transactions / sec`
Transaction Management	Response Time -- Average	`Distributed Transaction Coordinator\ Response Time -- Average`
Transaction Management	Transactions/sec	`Distributed Transaction Coordinator\ Transactions/sec`

Table: Automatic Application Server performance counters to monitor

Database Servers#

The set of components listed in the next table be subject to automatic monitoring at each database server.

Components	Description	Healthy State
SQL Server «CM MES Online Database» Service State	This is the online Critical Manufacturing database server instance	Running
SQL Server «CM MES ODS Database» Service State	This is the ODS Critical Manufacturing database server instance. The ODS can run in the same instance as the online database server instance	Running
SQL Server Agent «CM MES Online Database»	This is a SQL Server service that executes scheduled administrative tasks for the online database server instance	Running
SQL Server Agent «cCM MES ODS Database»	This is a SQL Server service that executes scheduled administrative tasks for the ODS database server instance. The ODS can run in the same instance as the online database server instance	Running
SQL Server «CM MES Data Warehouse Database» Analysis Services Service	State This is the Critical Manufacturing Data Warehouse analysis services server	Running
SQL Server Reporting Services «CM MES ODS Database»	This is the SQL Server service which is used for reporting purposes	Running
SQL Server Browser	A secondary service that supports browsing and discovery of SQL Server instances and services	Running
SQL Server Integration Services	A service used for SQL Server data integration and workflow applications	Running
SQL Server VSS Writer	A service that allows backups to be performed online	Running
Sentinel LDK License Manager	This is a licensing component which is used to license the Critical Manufacturing MES Scheduling module. It is typically installed in the database servers for high-availability purposes	Running

Table: Database Server performance counters to monitor

In addition, it’s recommended to check the availability and responsiveness of the database even in case that all components are up and running. Furthermore, it’s also necessary to check the execution state of a set of critical database jobs. These recommended tests are described in the table below.

Test	Test Procedure Description
Test the database responsiveness at a particular database server instance	Execute a SQL Select statement. The database server must respond correctly within a reasonable amount of time
Test the state of the following jobs in the Online Database: {Database name} Data Replication	Check if job status is Executing
Test the state of the following jobs in the Online Database: {DatabaseName} Delete History Tables {DatabaseName} Execute Enqueued Commands {DatabaseName} Execute ODS Enqueued Commands {DatabaseName} Purge Entity Types {DatabaseName} Reseed Identity columns {DatabaseName} Terminate Expired Chart Data Points	Check that the last Run Outcome is Succeeded for all jobs
Test the state of the following jobs in the ODS Database: {DatabaseName} Export Terminated Materials XML {DatabaseName} MainTablePartitioning {DatabaseName} ManagePartitions {DatabaseName} Enqueued Commands in ODS	Check that the last Run Outcome is Succeeded for all jobs
Test the state of the following jobs in the DWH Database: {DatabaseName} Critical Manufacturing Data Warehouse {DatabaseName} Critical Manufacturing DWH Partitioning Management	Check that the last Run Outcome is Succeeded for all jobs

Table: Database Server jobs to monitor

The set of performance counters listed in the next table must be subject to automatic monitoring at the database servers.

Area	Parameter	WMI Counter	Healthy State
CPU	CPU Utilization	`Processor\ % Processor Time(_Total)`	< 80% (average)
CPU	Number of threads in the processor queue	`System\Processor Queue Length`
Disk	Disk Transfers	`Logical Disk\ Disk Transfers/sec (_Total)`	Under 80% of its capacity (average)
Disk	Free Disk Space	`Logical Disk\ % Free Space (_Total)`	> 20%
Memory	Available Physical Memory	`Memory\ Memory\ Available Mbytes`
Memory	Memory hard page faults	`Memory\ Pages per Second`
Memory	Page File Utilization	`Paging File\ % Usage (_Total)`	< 80%(average)
Memory	Virtual Memory Utilization	`Memory\ % Committed Bytes in Use`	< 80% (average)
MSMQ	Microsoft Message Queue Journal and Dead-Letter Queue	`MSMQ\Messages In Queue (Computer Queues)`
Network	Network Interface\Bytes Total/Sec	`Network Interface\Bytes Total/Sec`	Under 80% of its capacity (average)
Network	Network Interface\Output Queue	`Length Network Interface\ Output Queue Length`
Transaction Management	Aborted Transactions	`Distributed Transaction Coordinator\ Aborted Transactions/sec`
Transaction Management	Committed Transactions	`Distributed Transaction Coordinator\ Committed Transactions / sec`
Transaction Management	Response Time - Average	`Distributed Transaction Coordinator\ Response Time - Average`
Transaction Management	Transactions/sec	`Distributed Transaction Coordinator Transactions/sec`

Table: Database Server performance counters to monitor

In addition to the performance counters above, it’s recommended to add the following performance counters for each database server instance. Each server instance provides a collection of WMI counters that start with MSSQL${Database Instance Name}.

Parameter	Description	WMI Counter
Batch Requests/Sec	Batch Requests/Sec measures the number of batches SQL Server is receiving per second. This counter is a good indicator of how much activity is being processed by your SQL Server box. The higher the number, the more queries are being executed on your box. Like many counters, there is no single number that can be used universally to indicate your machine is too busy. Today’s machines are getting more and more powerful all the time and therefore can process more batch requests per second. You should review this counter over time to determine a baseline number for your environment	`MSSQL${Database Instance Name}:SQL Statistics\Batch Requests/sec`
Buffer cache hit ratio	The buffer cache hit ratio counter represents how often SQL Server is able to find data pages in its buffer cache when a query needs a data page. The higher this number the better, because it means SQL Server was able to get data for queries out of memory instead of reading from disk. You want this number to be as close to 100 as possible. Having this counter at 100 means that 100% of the time SQL Server has found the needed data pages in memory. A low buffer cache hit ratio could indicate a memory problem	`MSSQL${Database Instance Name}:Buffer Manager\Buffer cache hit ratio`
Buffer page life expectancy	The page life expectancy counter measures how long pages stay in the buffer cache in seconds. The longer a page stays in memory, the more likely SQL Server will not need to read from disk to resolve a query. You should watch this counter over time to determine a baseline for what is normal in your database environment. Some say anything below 300 (or 5 minutes) means you might need additional memory	`MSSQL${Database Instance Name}:Buffer Manager\Page life expectancy`
Checkpoint Pages/Sec	The checkpoint pages per second counter measures the number of pages written to disk by a checkpoint operation. You should watch this counter over time to establish a baseline for your systems. Once a baseline value has been established you can watch this value to see if it is climbing. If this counter is climbing, it might mean you are running into memory pressures that are causing dirty pages to be flushed to disk more frequently than normal	`MSSQL${Database Instance Name}:Buffer Manager\Checkpoint pages/sec`
Lock Waits/Sec	In order for SQL Server to manage concurrent users on the system, SQL Server needs to lock resources from time to time. The lock waits per second counter tracks the number of times per second that SQL Server is not able to retain a lock right away for a resource. Ideally you don't want any request to wait for a lock. Therefore you want to keep this counter at zero, or close to zero at all times	`MSSQL${Database Instance Name}: Locks\Lock Waits/sec (_Total)`
Page Splits/Sec	This counter measures the number of times SQL Server had to split a page when updating or inserting data per second. Page splits are expensive, and cause your table to perform more poorly due to fragmentation. Therefore, the fewer page splits you have the better your system will perform. Ideally this counter should be less than 20% of the batch requests per second	`MSSQL${Database Instance Name}:Access Methods\Page Splits/sec`
Processes Blocked	The processes blocked counter identifies the number of blocked processes. When one process is blocking another process, the blocked process cannot move forward with its execution plan until the resource that is causing it to wait is freed up. Ideally you don't want to see any blocked processes. When processes are being blocked you should investigate	`MSSQL${Database Instance Name}:General Statistics\Processes blocked`
SQL Compilations/Sec	The SQL Compilations/Sec measure the number of times SQL Server compiles an execution plan per second. Compiling an execution plan is a resource-intensive operation. Compilations/Sec should be compared with the number of Batch Requests/Sec to get an indication of whether or not complications might be hurting your performance. To do that, divide the number of batch requests by the number of compiles per second to give you a ratio of the number of batches executed per compile. Ideally you want to have one compile per every 10 batch requests	`MSSQL${Database Instance Name}:SQL Statistics\SQL Compilations/Sec`
SQL Re-Compilations/Sec	When the execution plan is invalidated due to some significant event, SQL Server will recompile it. The Re-compilations/Sec counter measures the number of time a re-compile event was triggered per second. Re-compiles, like compiles, are expensive operations so you want to minimize the number of recompilations. Ideally you want to keep this counter less than 10% of the number of Compilations/Sec	`MSSQL${Database Instance Name}:SQL Statistics\SQL ReCompilations/Sec`
User Connections	The user connections counter identifies the number of different users that are connected to SQL Server at the time the sample was taken. You need to watch this counter over time to understand your baseline user connection numbers. Once you have some idea of your high and low water marks during normal usage of your system, you can then look for times when this counter exceeds the high and low marks. If the value of this counter goes down and the load on the system is the same, then you might have a bottleneck that is not allowing your server to handle the normal load. Keep in mind though that this counter value might go down just because less people are using your SQL Server instance	`MSSQL${Database Instance Name}:General Statistics\User Connections`

Table: Database Server WMI counters to monitor

ERP Integration#

ERP integration consists in the exchange of messages between Critical Manufacturing MES and the ERP. These messages are persisted in the system in the ERP Entry object. An easy way to monitor the health of the ERP Integration is to check, using a query, for active ERPEntry object instances in state Failed. Other states that can be checked for statistical purposes but do not affect the health of the system are: Received, Processed and Processing.

Monitoring Trends#

It’s important to understand critical trends in the system about resource usage and consumption as well as the system performance in a proactive and periodical manner. Regarding hardware resources, it’s necessary to compare from a reference period (baseline), how the usage has changed:

CPU Utilization
Memory Utilization
Network Utilization
Disk Space Utilization (mainly relevant for the database servers)

From a performance point of view, it’s also important to analyze how the system latency compares to the latency from a previous reference period.

Critical Manufacturing MES comes with a Service Performance standard report that lists the top services by latency (sorted by average duration). If the performance for a particular service starts to degrade considerably, it deserves a special investigation. If it turns out to be an issue related with the Critical Manufacturing MES standard product, please contact the Critical Manufacturing MES support team.

Kafka Cluster#

Dead-letter Queue#

Undeliverable and expired transactional messages are stored in transactional dead-letter queues.

For more information, refer to the Dead-letter queues ⧉ page.

Note

Messages in this queue indicate a problem with the system message bus. In this case, check Troubleshooting to solve the problem.

For more information, refer to Installation Manual > Microsoft Message Queuing message queuing or Configuring Remote Queue also in the Installation Manual.