Alerting

Alerting is an extension of monitoring activities. It is the process of notifying responsible parties when a system is not behaving as expected, as determined by pre-defined thresholds or conditions around system metrics or logs.

For example, alerts may be configured for sudden surges in traffic, high error rates, or a high number of consecutive failed login attempts. It is a requirement for such alerts to be delivered in near [real-time].

Alerting systems tend to be built into monitoring and observability systems. For example, Prometheus has Alert Manager, while Elasticsearch has Watcher However, there are a few standalone alerting systems that are designed to feed off data produced by monitoring and observability tools like Prometheus and the Elastic stack. Examples include:

  • PagerDuty — Alerting and incident management services.

  • OpsGenie — Alerting and on-call management, part of the Atlassian suite.

  • VictorOps — Was an on-call management system and is now part of Splunk.