To find out the available types of metrics, we can start by looking at what type of information we can track.
These would be anything involved in evaluating the health or performance of an individual machine, disregarding for the moment its application stacks and services.
CPU
Memory
Disk Space
Processes
Network traffic
Storage 1/O
System Metrics
These are metrics concerned with units of processing or work that depend on the host-level resources, like services or applications. The specific types of metrics to look at depend on what the service is providing, what dependencies it has, and what other components it interacts with.
Error and success rates
Service failures and restarts
Performance and latency of responses
Resource usage
These are important gauges of outward-facing availability but are also essential in ensuring that services are accessible to other machines for any systems that span more than one machine.
Connectivity
Error rates and packet loss
Latency
Bandwidth utilization
While metrics about individual servers are useful, at scale a service is better represented as the ability of a collection of machines to perform work and respond adequately to requests. This type of metric is in many ways just a higher-level extrapolation of application and server metrics, but the resources in this case are homogeneous servers instead of machine-level components.
Pooled resource usage
Scaling adjustment indicators
Degraded instances
Other metrics you may wish to add to your system are those related to external dependencies. Often, services provide status pages or an API to discover service outages, but tracking these within your systems—as well as your actual interactions with the service—can help you identify problems with your providers that may affect your operations.
Service status and availability
Success and error rates
Run rate and operational costs
Resource exhaustion