Metrics, The First Pillar of Observability

Have you ever deployed your side project without a care in the world just to see it crash the next day with you ending up for hours scratching your head and trying to find out what’s wrong? Have you ever been assigned a project where things go wrong left and right and you have to do rigorous inspections on the code accompanied by running the system manually, calling the API or god forbid, test on production?

Informally speaking, Observability is the tool that makes your life easier in scenarios similar to what mentioned. It aims at making your system more transparent; helping you find issues faster and diagnosing bugs easier.

Observability also helps you better understand the interaction between the components of your systems and the interactions between them. You can also track and monitor users or systems’ behavior and plan or respond accordingly.

In this article, which will be the first of a series, we dive into the topic of Metrics, one of the pillars of observability and some details into one of the powerful tools used in this domain: Prometheus.

Pillars of Observability

  • Pillars of Observability
  • Metrics
    • What is Prometheus?
    • Prometheus Metrics Explained
      1. Counter
        • Other Uses Cases
      2. Gauge
        • Observations
        • Other Use Cases
      3. Histogram
        • Observations
        • Other Use Cases
    • What Now?

Observability is the process of making a system more transparent, enabling engineers to better understand system performance and detect potential issues. The three pillars of observability are Traces, Logs, and Metrics, which combined together provide a complete picture of a system’s behavior.

Metrics

Photo taken from FastAPI Observability Dashboard on Grafana Labs

Metrics are numerical representations of data measured over intervals of time. You can capture different criteria in an easy form to analyze and aggregate. Metrics are often used to monitor key aspects of a system, such as resource utilization, application performance, or business operations.

Under the hood, a metric is a series of temporal observation, i.e. a time series. For example, at second t, the CPU consumption can be 37% and on t+1, it can be 42%. This time-series nature of metrics is paramount to being able to effectively capture, utilize and reason about them. Unfortunately, though, it can also make required calculations non-intuitive or harder to grasp at first.

In the rest of this article, we are going over different type of metrics and ways to understand and reason about each. In doing this, we are going to use Prometheus.

What is Prometheus?

Prometheus is an open-source monitoring and alerting toolkit. It excels at gathering real-time metrics in a time-series format and integrates with various systems and it has been an industry standard for more than a decade.

Prometheus has a few key components:

  1. Scheduler for scraping: It scrapes metrics from targets by querying HTTP endpoints at regular intervals.
  2. Time Series Database (TSDB): Metrics are stored as time series data, allowing efficient querying and analysis.
  3. PromQL: It comes with its own query language called PromQL, which is used to extract and aggregate data from the stored time series.
  4. Web UI: A simple web interface allows you to visualize metrics and run queries directly in your browser.

Now that the introductions are out of the way, let’s get our hands dirty with some metrics!

Prometheus Metrics Explained

Prometheus supports several types of metrics, with Counter, Gauge, and Histogram being among the most commonly used. Each of these fit specific use cases and offer different insight to your system.

We at Data Build Company, recently bought a couch for the office and since we’re all data freaks, I’m taking this opportunity of writing a blog to see how we can use different types of metrics to gather data about our nice couch.

Prompt: generate a comfortable cute couch at the office with people sitting, chatting and laughing around it.

1. Counter

As a first experiment, we want to track the number of times people sit on the couch. For this we can use a Counter. Counter is a type of metric that only increments, making it useful for tracking the occurrence of events over time. Once increased, it cannot go down, except when reset.

After placing the couch in the office at 12:00, these are the timestamps where someone sat on it:

  • 12:33 – person A sits on the couch
  • 12:45 – person B and C sit on the couch
  • 13:03 – person D sits on the couch
  • 13:18 – person E sits on the couch

Using a counter metric called “sits”, and incrementing it each time someone uses the couch, we would end up with these observations:

Time“sits” counter
12:000
12:331
12:453
13:034
13:185

We are assuming that the observations are scraped every minute. For the sake of brevity, the rows where the counter’s value is unchanged is omitted.

There are a few interesting queries one can do with a counter metric such as this. The first is increase. The following diagram displays the PromQL increase(sits[20m]):

In this query, the sits[20m] specifies the Range Vector of 20m windows over the “sits” metric. For example, in the window of 12:35 to 12:55, the range vector is {1,1,1,1,1,1,1,1,1,3,3,3,3,3,3,3,3,3,3,3} (not counting the value at 12:35). What the increase function does is getting the difference of the last and first value in this range vector (2 as shown in the graph).

Another function usually used with a counter is rate. It is similar to increase but averages the difference by the seconds in the window; for the previously mentioned instance, this would be 2 divided by 1200.

Other Use Cases

API Call Count

To track the rate of API requests:

Python
rate(api_requests_total[5m])

Error Monitoring

To monitor how quickly errors accumulate in a system:

Python
rate(errors_total[5m])

2. Gauge

A Gauge is a metric that can increase and decrease, making it perfect for measuring values that fluctuate over time, such as the current number of people sitting on a couch. In our example, with a gauge, we can track how many people are sitting on the couch at any given moment.

Observations

  • 12:33 – 1 person (A)
  • 12:45 – 3 people (A, B, C)
  • 12:51 – 1 person (A)
  • 13:03 – 2 people (A, D)
  • 13:12 – 1 person (A)
  • 13:15 – 0 people
  • 13:18 – 1 person (E)
  • 13:34 – 0 people

A handy query you can do on a gauge is avg_over_time. Calling our new metric “sitting”, we can get to the diagram below with avg_over_time(sitting[15m]):

The avg_over_time function averages a specific Range Vector. Since we are using windows of 15 minutes this time, one example is the value for 13:15. The Range Vectors from 13:00 to 13:15 would be {1,1,2,2,2,2,2,2,2,2,2,1,1,1,0} (not counting 13:00), and the average for it is 1.53 as is seen on the diagram.

Other Use Cases

Resource Usage

Gauge metrics are used to track CPU or memory usage in a system:

Python
avg(node_memory_MemAvailable_bytes)

Concurrent Connections:

Track the number of active user sessions::

Python
max(active_sessions)

3. Histogram

Now, we want to gain some insights on how long people sit on the couch. This can give us different information e.g. during what time of day do people sit for shorter intervals on the couch or what’s the average duration for sitting on the couch, etc.

What metric should be used for this?

If we use a gauge and store the duration of a sit each time someone leaves, we lose the distribution of durations (i.e. how many times have people sat on the couch for less than 5 minutes). And, if we use a counter we have no well defined way of keeping the durations.

Fortunately, Histograms can come to the rescue.

A Histogram tracks the distribution of values, like the duration of an event, into predefined buckets. It’s useful for understanding how data is distributed over time.

Observations:

  • 12:51 – B and C leave after sitting for 6m each
  • 13:12 – D leaves after sitting for 9m
  • 13:15 – A leaves after sitting for 42m
  • 13:34 – E leaves after sitting for 16m

A histogram allocates the data to a set of user-defined buckets. For the observations above, we can use buckets of <7 minutes, <20 minutes and +20 minutes (or <∞). With these, the duration that person D sat on the couch (9m) falls into the <20 minute bucket and the duration of person A into the <∞ bucket. Also, the buckets are cumulative which means all the instances fall under the <∞ bucket.

If we name the histogram metric “duration”, Prometheus creates multiple underlying metrics:

  • duration_sum: This is a counter metric summing the amount of all durations of people sitting on the couch (79 minutes in our example)
  • duration_count: This is a counter showing the total number of durations we have (5)
  • duration_bucket: For each of the buckets, we count how many instances fall into that bucket. For example, for bucket <7, which is shown by duration_bucket{le="7"} the number is 2

Now let’s get to a query:

Python
histogram_quantile(0.95, rate(duration_bucket[5m]))

The inner query (rate(duration_bucket[5m])calculates the rate over the last 5m (similar to the increase query we saw earlier). It is gonna give a rate per bucket. The histogram_quantile of 0.95 gives you the bucket for which 95% of the data fall under. This makes it so in the end, you would get one rate result.

If interested in the increase of short-interval-seats, we can do an increase query on the <7 bucket:

Python
increase(duration_bucket{le="7"}[5m])

Or, if we want to know the average duration of people sitting on the couch in the last 5m for each point in time, we can use:

Python
rate(duration_sum[5m]) / rate(duration_count[5m])

Other Use Cases

A histogram metric is highly effective for tracking HTTP request durations because it provides a detailed statistical overview of the observed data over time. It could also be used for request size distribution, throughput analysis, database query performance and any other criteria where distribution of values are involved.

What Now?

Metrics are a handy tool in any engineer’s toolkit. If you instrument your code well enough and gather different metrics, you can prevent failures and monitor you system efficiently. You can detect trends in the behavior of your users and test your hypotheses and the validity of new features. Plus all of these, you would make the life of people working on the system much easier.

Prometheus is a powerful technology that can help you with keeping track of your systems. Combining it with another tool like Grafana for creating dashboards and alerts will make you unstoppable in gaining insights into your system.

In the next entry we will focus on Logs, another pillar of observability that will give you timestamped records of events happening in the systems helping you take the observation of your system to the next level. Until then peace!

Share this post:

Related Articles
Data Builder Dan: Episode 2 – Warehouse Woes
Metrics, The First Pillar of Observability
The Five Stages of AI in Software Engineering

Interested to join our team?

We’re always looking for our next data builder. Checkout our careers page to see our current openings. Your voice powers our innovation at Data Build Company. Join our team, where your ideas are not just heard but championed, paving the way for future developments in data engineering.

Join the Data Build Company family!