Monitoring

Monitoring is the process of collecting, analyzing, and using data to track the [performance] and health of a system.

Monitoring is typically focused on collecting logs and metrics – things like CPU usage, memory consumption, error rates, and response times – to provide insights into a system’s state. The primary goal of monitoring is to detect and alert on predefined issues or failures within a system.

Monitoring is closely associated with observability. The same tools and data can be used for both, but the two concepts are distinct. Monitoring is a reactive process, reacting to events and changes in state based on pre-configured rules. Observability is a proactive process, focused instead on analyzing a systems behavior and performance after the fact.

Monitoring can be implemented at various levels of a system, including:

  • Infrastructure monitoring: Focuses on tracking the health and performance of underlying infrastructure like servers, containers, and virtual machines, through metrics like uptime, CPU utilization, memory utilization, disk I/O, and more.

  • Network monitoring: Tracks network health, including bandwidth utilization, packet loss, latency, roundtrip time, and potential security threats.

  • Application (performance) monitoring (APM): Monitors specific software applications, tracking metrics like response times, resource usage, error rate, and various business-specific metrics.

  • Database monitoring: Focuses on query performance, cache hit ratios, number of connections, storage utilization, and more.

Monitoring tools

  • Datadog — Popular monitoring and analytics service (proprietary).

  • Grafana — An open-source analytics and monitoring platform.

  • Nagios — A monitoring and alerting system for networks, servers, and applications.

  • Prometheus — An open-source monitoring and alerting toolkit.

  • Zabbix — An open-source monitoring tool, also available as a subscription service.

See also Observability tools, many of which include monitoring capabilities.