Recently, there has been a lot of discussion around OpenTracing. We’ll start this blog by introducing OpenTracing, explaining what it is and why it is gaining attention. Next, we will discuss distributed tracing system Jaeger and how it helps in troubleshooting microservices-based distributed systems. We will also set up Jaeger and learn to use it for monitoring and troubleshooting purposes.
Drift to Microservice Architecture
Microservice Architecture has now become the obvious choice for application developers. In the Microservice Architecture, a monolithic application is broken down into a group of independently deployed services. In simple words, an application is more like a collection of microservices. When we have millions of such intertwined microservices working together, it's almost impossible to map the inter-dependencies of these services and understand the execution of a request.
If a monolithic application fails then it is more feasible to do the root cause analysis and understand the path of a transaction using some logging frameworks. But in a microservice architecture, logging alone fails to deliver the complete picture.
Is this service called first in the chain? How do I span all these services to get insight into the application? With questions like these, it becomes a significantly larger problem to debug a set of interdependent distributed services in comparison to a single monolithic application, making OpenTracing more and more popular.
What is Distributed Tracing?
Distributed tracing is a method used to monitor applications, mostly those built using the microservices architecture. Distributed tracing helps to highlight what causes poor performance and where failures occur.
How OpenTracing Fits Into This?
The OpenTracing API provides a standard, vendor neutral framework for instrumentation. This means that if a developer wants to try out a different distributed tracing system, then instead of repeating the whole instrumentation process for the new distributed tracing system, the developer can easily change the configuration of Tracer.
Here are some basic terminologies of Opentracing:
Span - It represents a logical unit of work that has an operation name, the start time of the operation, and the duration.
Trace - A Trace tells the story of a transaction or workflow as it propagates through a distributed system. It is simply a set of spans sharing a TraceID. Each component in a distributed system contributes its own span.
OpenTracing is a way for services to “describe and propagate distributed traces without knowledge of the underlying OpenTracing implementation.”
Let us take the example of a service like renting a movie on any rental service like iTunes. A service like this requires many other microservices to check that the movie is available, proper payment credentials are received, and enough space exists on the viewer’s device for download. If either one of those microservice fail, then the entire transaction fails. In such a case, having logs just for the main rental service wouldn’t be very useful for debugging. However, if you were able to analyze each service you wouldn’t have to scratch your head to troubleshoot which microservice failed and what made it fail.
In real life, applications are even more complex and with the increasing complexity of applications, monitoring the applications has been a tedious task. Opentracing helps us to easily monitor:
- Spans of services
- Time taken by each service
- Latency between the services
- Hierarchy of services
- Errors or exceptions during execution of each service.
Jaeger: A Distributed Tracing System by Uber
Jaeger, is released as an open source distributed tracing system by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including:
- Distributed transaction monitoring
- Performance and latency optimization
- Root cause analysis
- Service dependency analysis
- Distributed context propagation
Major Components of Jaeger
Jaeger Client Libraries - Jaeger clients are language specific implementations of the OpenTracing API.
Agent - The Jaeger agent is a network daemon that listens for spans sent over UDP, which it batches and sends to the collector. It is designed to be deployed to all hosts as an infrastructure component. The agent abstracts the routing and discovery of the collectors away from the client.
Collector - The Jaeger collector receives traces from Jaeger agents and runs them through a processing pipeline. Currently, the pipeline validates traces, indexes them, performs transformations, and finally, stores them. Jaeger’s storage supports Elasticsearch, Cassandra and Kafka.
Query - Query is a service that gets traces from storage and hosts a UI to display them.
Ingester - Ingester is a service that reads from Kafka topic and writes to another storage backend (Cassandra, Elasticsearch).
Running Jaeger in a Docker Container
1. First, install Jaeger Client on your machine:
2. Now, let’s run Jaeger backend as an all-in-one Docker image. The image launches the Jaeger UI, collector, query, and agent:
TIP: To check if the docker container is running, use: Docker ps.
Once the container starts, open http://localhost:16686/ to access the Jaeger UI. The container runs the Jaeger backend with an in-memory store, which is initially empty, so there is not much we can do with the UI right now since the store has no traces.
Creating Traces on Jaeger UI
1. Create a Python program to create Traces:
Let’s generate some traces using a simple python program. You can clone the Jaeger-Opentracing repository given below for a sample program that is used in this blog.
The Python program takes a movie name as an argument and calls three functions that get the cinema details, movie showtime details, and finally book a movie ticket.
It creates some random delays in all the functions to make it more interesting, as in reality the functions would take certain time to get the details. Also the function throws random errors to give us a feel of how the traces of a real-life application may look like in case of failures.
Here is a brief description of how OpenTracing has been used in the program:
- Initializing a tracer:
- Using the tracer instance:
- Starting new child spans using start_span:
- Using Tags:
- Using Logs:
2. Run the python program:
Now, check your Jaeger UI, you can see a new service "booking" added. Select the service and click on "Find Traces" to see the traces of your service. Every time you run the program a new trace will be created.
You can now compare the duration of traces through the graph shown above. You can also filter traces using “Tags” section under “Find Traces”. For example, Setting “error=true” tag will filter out all the jobs that have errors, as shown:
To view the detailed trace, you can select a specific trace instance and check details like the time taken by each service, errors during execution and logs.
The above trace instance has four spans, the first representing the root span “booking”, the second is the “CheckCinema”, the third is the “CheckShowtime” and last is the “BookShow”. In this particular trace instance, both the “CheckCinema” and “CheckShowtime” have reported errors, marked by the error=true tag.
In this blog, we’ve described the importance and benefits of OpenTracing, one of the core pillars of modern applications. We also explored how distributed tracer Jaeger collect and store traces while revealing inefficient portions of our applications. It is fully compatible with OpenTracing API and has a number of clients for different programming languages including Java, Go, Node.js, Python, PHP, and more.