On this page

The Evolution from Monitoring to Observability

The Three Pillars of Observability

1. Logs

2. Metrics

3. Traces

The Problem OpenTelemetry Solves

How OpenTelemetry Works: The Architecture

1. Instrumentation

2. The OpenTelemetry Collector

3. Exporters

Context Propagation

Why This Matters for System Design Interviews

The Role of Sampling

Conclusion

OpenTelemetry: The Beginner’s Guide to Modern Observability

Image
Arslan Ahmad
Understand how data flows in modern architectures. A step-by-step technical explanation of OpenTelemetry, Collectors, and Context Propagation.
Image
On this page

The Evolution from Monitoring to Observability

The Three Pillars of Observability

1. Logs

2. Metrics

3. Traces

The Problem OpenTelemetry Solves

How OpenTelemetry Works: The Architecture

1. Instrumentation

2. The OpenTelemetry Collector

3. Exporters

Context Propagation

Why This Matters for System Design Interviews

The Role of Sampling

Conclusion

Modern software architectures have evolved from single, massive programs into complex networks of smaller, interconnected services.

This shift toward microservices introduces a level of operational difficulty that makes traditional debugging methods ineffective. When a software system spans hundreds of servers and involves different programming languages, finding the exact location of a failure becomes a significant engineering challenge.

The industry solution to this problem is a concept known as Observability, and the global standard for implementing it is OpenTelemetry.

The Evolution from Monitoring to Observability

In the past, software usually ran as a monolith. This means the user interface, the business logic, and the database connections all lived inside one executable file running on one server.

If the application crashed or slowed down, an engineer could simply open the central log file on that specific server to find the error. The scope of the problem was limited to a single machine.

Cloud computing and microservices have fundamentally changed this landscape.

Now, a single user action might trigger a chain reaction across dozens of different services.

One service might handle authentication, another might process payments, and a third might update inventory.

If the system becomes slow, looking at the health of a single server is not enough. The server might report that it is healthy and running, yet the user still experiences a failure.

This distinction leads to the important difference between Monitoring and Observability.

Monitoring tells you the overall state of the system. It answers questions like "Is the server on?" or "Is the database responding?" or "Is disk space low?" It acts as a status check.

Observability relies on the external outputs of the system to explain its internal state. It allows you to answer "Why is the system slow?" or "Why did this specific request fail?" It provides the granular details required to debug complex interactions that were not predicted in advance.

To achieve high observability, engineers need to collect three specific types of data. These are widely known as the Three Pillars of Observability.

The Three Pillars of Observability

OpenTelemetry is designed to collect and unify these three distinct types of data. Understanding the role of each pillar is required to understand how the system works as a whole.

1. Logs

Logs are the most fundamental form of recording data.

A log is a timestamped text record of a discrete event that happened within the code.

When a developer writes code, they often insert print statements or logging commands to output information during execution.

In a production environment, these outputs are structured and saved.

A log entry typically contains a timestamp, a severity level (like INFO, WARN, or ERROR), and a message describing what happened.

Why Logs Are Critical: Logs are excellent for high-resolution debugging. They provide the exact details of a specific failure. If a database connection is refused due to a bad password, the log will contain that specific error message.

The Limitation of Logs: Logs are expensive to store because they generate massive amounts of text data. In a high-traffic system, generating millions of lines of text per minute makes it difficult to find the signal in the noise. Reading a million lines of text to find a pattern is inefficient and slow.

2. Metrics

Metrics are numerical representations of data measured over intervals of time.

Unlike logs, which record specific text events, metrics record numbers and statistics.

Common examples of metrics include:

  • CPU Usage: What percentage of processing power is being used?
  • Memory Consumption: How much RAM is free?
  • Request Count: How many users hit the API in the last minute?
  • Latency: What is the average time it takes to serve a request?
  • Error Rate: What percentage of requests resulted in a failure?

Why Metrics Are Critical: Metrics are highly efficient for storage. Storing a single number that represents the average latency over a minute takes up very little space compared to storing a log file for every request in that minute.

Metrics are perfect for triggering alerts.

If CPU usage goes above 90%, the system can automatically notify an engineer.

The Limitation of Metrics: Metrics lack context. A metric can tell you that the error rate spiked to 50%, but it cannot tell you which specific user caused the error or why it happened. It provides the "what" but not the "why."

Image

3. Traces

Traces are the connector that binds distributed systems together.

A trace represents the entire journey of a single request as it moves through a distributed system.

When a request enters the system, it is assigned a unique identifier. As that request moves from Service A to Service B and then to the Database, that identifier moves with it. This allows engineers to visualize the full path.

A trace is made up of smaller units called Spans.

A span represents a single operation within the trace, such as a database query or an HTTP call. Each span records how long that specific operation took and whether it succeeded or failed.

Image

Why Traces Are Critical

Traces allow engineers to pinpoint bottlenecks in a complex chain.

If a login takes five seconds, a trace can show that 4.9 seconds were spent waiting for the legacy database, while the rest of the application was fast.

Without traces, you would have to guess which service was causing the delay.

The Problem OpenTelemetry Solves

Before OpenTelemetry, collecting this data was difficult. There were many different vendors and tools, and each had its own way of describing data.

One tool might call a metric "cpu_load" while another called it "cpu_usage."

One library might handle tracing in Java, but it would not work with a library written in Go. This created Vendor Lock-in.

If a company spent months integrating a specific monitoring tool, switching to a better tool later was nearly impossible because they would have to rewrite all their data collection code.

OpenTelemetry (OTel) acts as a standardized translation layer. It is an open-source framework that provides a single set of APIs, libraries, and agents to collect logs, metrics, and traces.

With OpenTelemetry, developers instrument their application once using the OTel standard. The data can then be sent to any backend tool they choose.

If the team decides to switch analysis tools next year, they do not need to change a single line of application code. They only need to change the configuration of where OpenTelemetry sends the data.

This decoupling of data generation (the code) from data analysis (the tool) is the primary value proposition of OpenTelemetry.

How OpenTelemetry Works: The Architecture

The architecture of OpenTelemetry is designed to be flexible and modular. It consists of three main stages: Instrumentation, Collection, and Exporting.

1. Instrumentation

Instrumentation is the process of adding code to an application to generate telemetry data. OpenTelemetry offers two primary methods for this.

Automatic Instrumentation: This is the most popular method for beginners. OpenTelemetry provides agents that can attach to a running application without modifying the source code. For languages like Java, Python, and Node.js, the agent can automatically inject itself into standard libraries. It can detect when an HTTP request is received or a database query is made and generate traces and metrics automatically. This allows teams to get visibility immediately with zero code changes.

Manual Instrumentation: For more complex needs, developers import the OpenTelemetry API into their code. They can manually start spans, add custom attributes to logs, or record specific business metrics (like "number of items added to cart"). This offers granular control over exactly what data is recorded and allows for tracking business logic that an automatic agent would not understand.

2. The OpenTelemetry Collector

The Collector is a crucial component that acts as a middleman between the application and the backend analysis tools. It is a vendor-agnostic proxy that receives, processes, and exports telemetry data.

The Collector performs three specific functions in a pipeline:

  • Receivers: The receiver component listens for incoming data from the applications. It supports multiple protocols, ensuring it can accept data from various sources.

  • Processors: The processor component modifies the data before sending it out. This is a powerful feature for data governance. For example, a team can use a processor to filter out sensitive data (like removing credit card numbers or PII from logs) before the data ever leaves the infrastructure. It can also aggregate metrics to reduce costs.

  • Exporters: The exporter component sends the processed data to the final destination.

Using a Collector is considered a best practice because it offloads the work from the application.

The application simply fires the data to the local Collector and goes back to processing user requests.

The Collector handles the heavy lifting of formatting, retrying, and sending that data over the network.

Image

3. Exporters

The exporter is the final link in the chain. It translates the internal OpenTelemetry data format into the specific format required by the backend analysis tool.

For example, if a team uses Prometheus for metrics and Jaeger for tracing, the OpenTelemetry Collector will use a Prometheus exporter and a Jaeger exporter. The application code effectively does not know or care which tools are used for the final display. This makes the architecture highly modular and future-proof.

Context Propagation

To understand how distributed tracing actually works, one must understand Context Propagation. This is the mechanism that ensures a trace stays connected across different servers.

When Service A calls Service B, it cannot just send the standard data payload. It must also inject metadata about the current observation. This metadata is called the Trace Context.

The Trace Context typically consists of:

  • Trace ID: A unique global ID for the entire request chain.

  • Span ID: The ID of the specific operation in Service A that triggered the call.

  • Trace Flags: Options such as whether this request should be sampled (recorded) or ignored.

This context is passed via HTTP headers (often following the W3C Trace Context standard).

When Service B receives the request, it extracts these headers. It sees the Trace ID and understands that its work is part of a larger parent operation. It creates a new Span ID for its own work but links it to the parent Span ID from Service A.

If this propagation fails, the trace is broken.

The system would see two disconnected traces rather than one unified story.

OpenTelemetry libraries handle this injection and extraction automatically for standard protocols, ensuring the link remains intact without manual developer intervention.

Why This Matters for System Design Interviews

For candidates preparing for System Design Interviews (SDI), adding observability to a design is often the difference between a mid-level and a senior-level answer.

When designing a system like "Design Instagram" or "Design a Ride-Sharing App," interviewers will inevitably ask how the system handles failure at scale.

Simply saying "we will check the logs" is an insufficient answer for a large-scale distributed system.

A strong candidate will explain:

  • How they will use Metrics to monitor the overall health (latency, traffic, errors, saturation).

  • How they will use Tracing to debug slow requests across microservices.

  • How they will use Sampling to avoid recording every single request, which saves money and storage.

Understanding OpenTelemetry demonstrates that the candidate knows how to maintain the system after it is built. It shows an awareness of operational complexity and the realities of running code in production.

The Role of Sampling

In high-traffic systems, recording a trace for every single request is too expensive and resource-intensive.

If a system processes one million requests per second, storing full trace data for all of them would require massive storage infrastructure.

This is where Sampling comes in.

Sampling is the practice of recording only a subset of traces.

Head-Based Sampling: The decision to record the trace is made at the very beginning, when the request hits the first service. The system might decide to record 1% of all traffic. This is simple and efficient because if the first service drops the trace, all downstream services know to drop it as well, saving resources.

Tail-Based Sampling: The decision is made at the end of the request. This is more powerful but harder to implement. It allows the system to say, "Discard all the successful traces, but keep 100% of the traces that resulted in an error or took longer than two seconds." This ensures that the engineering team always has data on the problematic requests, which are the ones that actually matter for debugging.

Conclusion

OpenTelemetry has emerged as the industry standard for observability because it unifies the fractured landscape of logging, metrics, and tracing.

By providing a vendor-neutral framework, it allows engineering teams to focus on the quality of their code rather than the specifics of their monitoring tools.

Key Takeaways:

  • Observability is about understanding the internal state of a system based on its external outputs, whereas Monitoring only checks the system health.
  • The Three Pillars are Logs (events), Metrics (aggregates), and Traces (request journeys).
  • OpenTelemetry standardizes how this data is collected, ensuring you are not locked into a single vendor.
  • The Collector sits between your code and your analysis tools, processing data efficiently.
  • Context Propagation is the mechanism that links requests across different microservices using Trace IDs.
  • Instrumentation can be automatic (zero-code) or manual (custom logic) depending on the depth of data required.

Mastering these concepts provides the foundation for building reliable, scalable systems that can be maintained and debugged with confidence.

As you grow in your engineering career, the ability to see inside the "black box" of your application will be one of your most valuable skills.

What our users say

Simon Barker

This is what I love about http://designgurus.io’s Grokking the coding interview course. They teach patterns rather than solutions.

MO JAFRI

The courses which have "grokking" before them, are exceptionally well put together! These courses magically condense 3 years of CS in short bite-size courses and lectures (I have tried System Design, OODI, and Coding patterns). The Grokking courses are godsent, to be honest.

Ashley Pean

Check out Grokking the Coding Interview. Instead of trying out random Algos, they break down the patterns you need to solve them. Helps immensely with retention!

More From Designgurus
Substack logo

Designgurus on Substack

Deep dives, systems design teardowns, and interview tactics delivered daily.

Read on Substack
Annual Subscription
Get instant access to all current and upcoming courses for one year.

Access to 50+ courses

New content added monthly

Certificate of completion

$33.25

/month

Billed Annually

Recommended Course
Grokking the System Design Interview

Grokking the System Design Interview

163,425+ students

4.7

Grokking the System Design Interview is a comprehensive course for system design interview. It provides a step-by-step guide to answering system design questions.

View Course
Join our Newsletter

Get the latest system design articles and interview tips delivered to your inbox.

Read More

Expert Resume Critiques: Common Mistakes and Quick Fixes

Arslan Ahmad

Arslan Ahmad

ATS-Friendly Resume Guide: Beat the Bots and Land FAANG Interviews

Arslan Ahmad

Arslan Ahmad

CodeKarle vs. DesignGurus.io: The One-Man Show vs. The Platform

Arslan Ahmad

Arslan Ahmad

Netflix System Design Interview Questions: An In-Depth Guide

Arslan Ahmad

Arslan Ahmad

Image
One-Stop Portal For Tech Interviews.
Copyright © 2026 Design Gurus, LLC. All rights reserved.