In this article, intern engineers on the open source observability team, Karen Xu and Mark Seufert, describe their design and implementation of an open source prototype Logging API and SDK for the OpenTelemetry C++ library. This library allows users to write and collect logs from their distributed and instrumented applications. This article details the goals of the project, the design of these components, and the authors’ experience working on the open source project, including challenges and lessons learned along the way.
OpenTelemetry is an open source project developed under the Cloud Native Computing Foundation (CNCF). It provides a set of tools and libraries for developers to collect and correlate telemetry information from their distributed applications. OpenTelemetry’s main goal is to support the collection of the three pillars of observability: traces, metrics, and logs. Currently, OpenTelemetry has developed the tracing and metric specification and implementations, but the logging specification and implementation are still in development.
The goal of this project was to prototype the logging functionality for the C++ Language Library, which allows users to write observability-compliant logs for their applications. These logs can get sent to various backend locations, such as Elasticsearch or Loki. Because custom logging wasn’t implemented for any other OpenTelemetry language before this point, our project also helped develop the logging specification, which defines the standard for how other languages would implement custom logging.
We researched the OpenTelemetry specification for guidance on how the logging functionality should be implemented. We found that OpenTelemetry had not yet developed a clearly defined specification, which already existed for traces and metrics. While working on this prototype for the OpenTelemetry C++ library, we followed the specification for traces, which had a lot of overlap with logging (this prototype would benefit the work in progress on the logging specification). Some underlying logic that was specific to the tracing portions, such as the API, was discarded in our implementation. We also found many additional enhancements that were specific to logging (e.g., having named Loggers similar to other logging libraries), which we filed as issues to be added to the specification later.
Although the OpenTelemetry Logging specification had not been fully developed, it had a Log Data Model, which defined how logs could be represented in an OpenTelemetry Collector compliant manner.
Log Data Model
OpenTelemetry’s current main source of logging representation is the Log Data Model. This model is an internal representation of a log record that is compliant with the OpenTelemetry Collector. The Log Data Model contains 10 fields, which can be written to by the Logging API and SDK to represent the log from the original source. All 10 fields, along with a brief description, are provided in the following table:
|Timestamp||Time when the event occurred.|
|TraceId||Request trace identification.|
|SpanId||Request span identification.|
|TraceFlags||W3C trace flag.|
|SeverityText||The severity text (also known as log level).|
|SeverityNumber||Numerical value of the severity.|
|Name||Short event identifier.|
|Body||The body of the log record.|
|Resource||Describes the source of the log.|
|Attributes||Additional information about the event.|
The objective of introducing the logging pipeline for OpenTelemetry is to allow logs to be correlated to other sources of telemetry in the collector. Thus, the TraceId, SpanId, and TraceFlags can be added as a part of logs, which would allow traces and logs to be correlated.
Other fields, such as the severity text and number, allow less-important logs to be filtered out because an application could generate thousands of detailed and verbose log messages, not all of which could be useful to the user.
The last four fields are the name, body, resource, and attributes fields. These are the fields that the user will actually populate when making a log. The name and body are string fields, so you could store the text “Hello world” in there, and the resource and attributes fields are key value pairs, which allow you to store a map or a vector.
OpenTelemetry logging overview
Initial evaluation of existing C++ logging libraries
Several programming language repositories in OpenTelemetry already have their own built-in or commonly used logging libraries, and their implementations are based on the APIs of existing logging libraries for that language. For example, Java uses the well-known Log4J, and .NET uses the built-in ILogger.
To determine whether our C++ logging implementation could leverage the API of an existing logging library, which would simplify our design, we evaluated a few of the major existing C++ libraries.
Major requirements for an existing library include having an open source license, a reliable and reputable vendor, a significant number of users, and the ability to support structured logging. The following table shows our evaluation of eight of the most promising C++ logging libraries:
|Library||Reason for elimination|
|1||Log4Cxx||No structured logging|
|2||Log4Cpp||No recent development|
|4||Spdlog||No structured logging|
|6||EasyLogging++||No structured logging|
|7||Quill||Very new and small number of users|
|8||Blackhole||Small number of users|
After evaluating the above C++ logging libraries, we found that none met all of our requirements. This meant that we wouldn’t be able to leverage an existing logging library and would instead need to develop a custom C++ OpenTelemetry Logging API. We then moved on to scoping out our design goals for the project, taking into account specifics to OpenTelemetry, industry logging standards, and general modular design best practices.
Logging library design
Similar to the other two pillars of observability, to implement the logging pipeline, we needed to develop an API, SDK, and exporters. The API would be used to capture data from users, the SDK to configure and process the data, and exporters would be implemented to send the log data to. This design is shown in the following diagram:
Having done the initial research and focused the main tenets of our design, we drafted an initial UML design of our logging prototype.
The classes and methods that we designed for the Logging API and SDK involved the following UML components. We implemented a generic exporter interface that will be implemented by each individual exporter. The API components are in blue, the SDK in green, the no-operation (NOOP) classes in yellow, and the data structures and enums in orange.
The purpose of an API is to provide the user with the classes and functions required to use the library and no more. Our goal was to keep the API simple and minimal. Adding functionality to the API later is easy, but it’s poor practice to make breaking changes by removing functionality. We decided on three main classes: the Logger, the LoggerProvider, and the Provider.
The Logger class is the foundation of our logging library. It contains the typical
Log(“Hello World!”) methods that a person thinks of when they imagine a logging statement. In addition to these basic logging calls that are expected in any logging library, our API also contains the gargantuan method below:
This method contains all the fields of the Log Data Model and allows users to write data that map directly into it.
The second class we defined is the LoggerProvider, which is used to keep track of the instances of Logger that have been created via its
GetLogger(name) method. We encourage users to create logger instances via this method, instead directly instantiating them, because having all the loggers stored in a centralized location (the LoggerProvider) makes it easier to keep track of the loggers. Additionally, it simplifies the design of our SDK.
The third class we defined is the Provider, which is used to save a singleton reference of a LoggerProvider. It offers two static methods to Get and Set this singleton instance. We defined it this way to make it easy for an SDK implementation to be pushed into the API.
In addition to these main classes, we also created no-operation classes, that define a minimal implementation of the API. This approach ensures that if a user includes the API without including an SDK definition, the API will still work and not throw any errors or exceptions.
Sample Logger API calls
Using the main logging method that we created in the API, which supports logging all the Log Data Model fields, the example below shows a user writing a log with some of the fields populated with their data:
To offer more convenience, the Logging API has overloads that are typical in other logging libraries, which wrap into the general log method from above. Several examples are shown in the following:
The purpose of the Logging SDK is to implement the back-end logic for the API calls and send the logs out to a destination. The SDK is responsible for implementing three main functionalities to achieve this goal:
- Implementing the API calls: The classes defined in the API require an implementation inside the SDK in order to have logic (instead of a no-operation).
- Processing logs: Processing the log data into batches, which get sent to a target after a buffer limit is met or time limit is reached.
- Exporting logs: Sending it to one or more configured exporter(s).
The classes and function calls required to implement this functionality are shown in the following data pipeline.
SDK data path diagram
The sequence diagram for the SDK Logging pipeline is shown here:
Step 1: Setup
- Start at the
- Get an instance of it using
- User calls
SetProcessor(processor)to get a
Processor, which they define and which is attached to an
- User gets a
Loggerinstance from the
GetLogger(name)method defined in the API.
Step 2: Creating a log
- (*) The user calls a
Log()method from the
Logger::Log()calls the processor’s
MakeRecordable()to get a
Exporter::MakeRecordable()to get a
Recordable, which it passes back to the original
Logger::Log()that called it.
Step 3: Processing the log
- To push the data into the
Log()method directly calls the processor’s
Step 4: Exporting the log
Processorcalls the exporter’s
ExportResultbased on whether the logs were written successfully. The processor can decide what to do with this result (e.g., simply keeping count of successful vs. dropped logs).
The SDK will provide definitions for both the API LoggerProvider and the Logger classes. For the SDK LoggerProvider, it will contain a list of loggers that it created from the
GetLogger(name) method and introduce
For the SDK Logger, the SDK will implement the logic behind the
Log(Severity, name, body, ...) method. The data will be stored inside a recordable object and then sent to the LoggerProvider’s processor to be batched and exported. Additionally, the logger will inject the timestamp, severity, trace identification, span identification, and trace flags into the recordable if none are specified by the user.
Instead of having the SDK create a log record struct or class that is in turn passed to or shared by all the processors/exporters, a
Recordable data type will be implemented by exporters and passed back to the SDK for data injection. This approach will allow exporters to provide custom internal implementations of a “Recordable” data structure and allow the exporters to have custom data format types.
Recordable interface will have 10 public methods that can be called by the SDK. These will be Setters for each of the 10 Log Data Model fields. These fields will be called by the
Logger SDK implementation to inject the user’s log data.
The SDK will provide a default implementation of a
LogRecord class. These will have the fields of the Log Data Model stored in variables in private fields. Their default values will be taken care of by the constructors of these data structures. We will inherit the public
Set*() methods of the Recordable interface, as well as provide additional
Get*() methods for each field.
Processor (simple and batch)
LogRecord is received by the Logger, it is sent to a processor. We defined two types of log processors: a simple processor and a batch processor. The simple processor will simply send a log record to an exporter as soon as it is created by the user. Meanwhile, the batch processor will store the logs that it receives in a circular buffer, until it is time to create and send a batch of logs to the exporter. The batch processor takes three configuration options as parameters: the max size of the buffer that stores all the LogRecords, the max batch size per export call, and the delay interval between two consecutive exports.
The last step for our SDK was creating an exporter interface. The interface defines all the required methods that an exporter will need to have, but it does not actually implement it. A user wanting to create a custom exporter for the logging library will simply extend from this interface and implement each of the methods to what they want.
This interface will consist of two main methods:
Export() should take the batch of LogRecords that is sent by the processor and implement the custom logic required to format and send the logs depending on the specific export destination. For example, exporting to Elasticsearch would include formatting the logs in a JSON-encoding and using HTTP REST API to send the logs to Elasticsearch. The
Export() function method should later return one of two response options: an export success, if the log was successfully sent, or an export failure, if the log was dropped in the process.
The Logging API had an additional OpenTelemetry requirement that it had to maintain ABI stability. This means that it could not depend on any library that is not defined inside of the
opentelemetry/include folder. Instead of using std classes, the API should use nostd classes instead.
Other details that we considered to improve our code maintainability and to conform to the C++ repository guidelines included: error handling, concurrency, and coding style. For error handling, because the C++ repository had not yet developed a way of diagnosing the errors from the SDK back to the user, we had to ensure that no exceptions or errors were thrown from our code and, furthermore, that all functions would be thread safe. The specification listed some methods that would not be called concurrently, but we had to operate under the general assumption that the code would be run by multiple cores or processors. For code readability, we also followed the C++ Google Style guide, which is what was used throughout the repository.
We tested the components and entire logging pipeline using both unit and integration tests. We unit tested each method and edge case using GoogleTest and ensured an approximately 95% code coverage, measured by codecov.io.
Furthermore, we tested the complete pipeline using a concrete implementation of both exporter (OStream and Elasticsearch) and both processor (simple and batch) to ensure the functionality of the SDK was working properly as a whole. This integration test follows the same pseudocode as the example from above. The GIF below shows a log with attributes being exported to Elasticsearch and being viewed through Kibana.
Usually, a great way to understand a concept is through an example. The code snippet shown below shows the pseudocode of how a user would initialize the logging pipeline and write a simple log in C++.
The output of this example is shown in the following code block, with sample data injected for the timestamp, trace identification, span identification, and trace flags fields.
Experiences and takeaways
Throughout this project, we’ve learned a lot about designing large-scale software — from scoping its initial requirements, to writing preliminary design documents, to coding the first prototype, and finally testing the individual units and integration of it all. We learned that designing for reliability, scalability, and distribution is a far more rigorous task than we had done before and is a truly collaborative and rewarding endeavor. We also learned much more about the open source world and its best practices.
Undertaking such a large-scale project taught us how to seek feedback proactively, demonstrate project ownership, and be passionate and actively involved in a complex and evolving project. The experience involved a lot of collaboration with other engineers to seek suggestions, ask questions, and improve and iterate on the design. Overall, we really appreciated the mentorship, collaboration, and guidance of our mentors, Richard Anton, Reiley Yang, and Tom Tan, our manager, Alolita Sharma, our fellow AWS engineers, and the OpenTelemetry engineering community with whom we had the opportunity to work. We encourage new open source developers to try contributing to open source to learn and build, and to challenge themselves by getting involved in a large, collaborative project such as OpenTelemetry.
Future enhancements that we were not able to cover in our creation of the logging prototype include:
- Offering more flexibility in the logging API by introducing more functions, macros, and ways to log. (#441)
- Supporting multiple processors to allow multiple export destinations at once. (#417)
- Being able to use the Logging SDK with other well-known logging libraries, such as Log4cxx. (#420)
We have filed these as issues in the current C++ repository for other contributors to work on.
With this initial implementation of the full logging pipeline, user applications instrumented with the C++ Logging API will now be able to produce and collect logs from their application. More importantly, language repositories in the future will be able to use this logging prototype as a basis for their specific language implementations. The prototype will also help the OpenTelemetry project as a whole to continue active development of the OpenTelemetry logging specification.
About the Authors
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.