Metric Types

To find out the available types of metrics, we can start by looking at what type of information we can track.

Host-based metrics

These would be anything involved in evaluating the health or performance of an individual machine, disregarding for the moment its application stacks and services.

  • CPU

  • Memory

  • Disk Space

  • Processes

  • Network traffic

  • Storage 1/O

  • System Metrics

Application metrics

These are metrics concerned with units of processing or work that depend on the host-level resources, like services or applications. The specific types of metrics to look at depend on what the service is providing, what dependencies it has, and what other components it interacts with.

  • Error and success rates

  • Service failures and restarts

  • Performance and latency of responses

  • Resource usage

Network and connectivity metrics

These are important gauges of outward-facing availability but are also essential in ensuring that services are accessible to other machines for any systems that span more than one machine.

  • Connectivity

  • Error rates and packet loss

  • Latency

  • Bandwidth utilization

Server pool metrics

While metrics about individual servers are useful, at scale a service is better represented as the ability of a collection of machines to perform work and respond adequately to requests. This type of metric is in many ways just a higher-level extrapolation of application and server metrics, but the resources in this case are homogeneous servers instead of machine-level components.

  • Pooled resource usage

  • Scaling adjustment indicators

  • Degraded instances

External dependency metrics

Other metrics you may wish to add to your system are those related to external dependencies. Often, services provide status pages or an API to discover service outages, but tracking these within your systems—as well as your actual interactions with the service—can help you identify problems with your providers that may affect your operations.

  • Service status and availability

  • Success and error rates

  • Run rate and operational costs

  • Resource exhaustion

Last updated