Monitoring

Monitoring is the process of collecting, analyzing, and using data to track the performance and health of a system.

Monitoring is typically focused on collecting logs and metrics – things like CPU usage, memory consumption, error rates, and response times – to provide insights into a system’s state. The primary goal of monitoring is to detect and alert on predefined issues or failures within a system.

Monitoring is closely associated with observability. The same tools and data can be used for both, but the two concepts are distinct. Monitoring is a reactive process, reacting to events and changes in state based on pre-configured rules. Observability is a proactive process, focused instead on analyzing a systems behavior and performance after the fact.

Monitoring can be implemented at various levels of a system, including:

Infrastructure monitoring: Focuses on tracking the health and performance of underlying infrastructure like servers, containers, and virtual machines, through metrics like uptime, CPU utilization, memory utilization, disk I/O, and more.
Network monitoring: Tracks network health, including bandwidth utilization, packet loss, latency, roundtrip time, and potential security threats.
Application (performance) monitoring (APM): Monitors specific software applications, tracking metrics like response times, resource usage, error rate, and various business-specific metrics.
Database monitoring: Focuses on query performance, cache hit ratios, number of connections, storage utilization, and more.

Monitoring tools

Datadog — Popular monitoring and analytics service (proprietary).
Grafana — An open-source analytics and monitoring platform.
Nagios — A monitoring and alerting system for networks, servers, and applications.
Prometheus — An open-source monitoring and alerting toolkit.
Zabbix — An open-source monitoring tool, also available as a subscription service.

See also Observability tools, many of which include monitoring capabilities.