In this blog post, AWS interns Wilbert Guo and Kelvin Lo share their experience in enhancing the OpenTelemetry Go SDK to support sending traces to AWS X-Ray. These enhancements are also available in the AWS Distro for OpenTelemetry.

AWS X-Ray is a service that collects data and provides tools that allow us to view, filter, and gain insights into that data in order to identify issues and areas that could be optimized. The data that it collects includes trace data that traces user requests as they travel through the entire application. It then aggregates this trace data so that we can see more detailed information not only about the request and response, but also about calls that our application makes to downstream. These downstream calls can include AWS resources, microservices, databases, and HTTP web APIs.

Supporting AWS X-Ray with OpenTelemetry Go SDK

If you are working in the observability space, you may already know all about the OpenTelemetry project. OpenTelemetry is a fast-developing open source project under the Cloud Native Computing Foundation (CNCF) that provides a set of APIs and SDKs for robust and portable telemetry of cloud-native software. The project enables customers to send telemetry data to multiple different backends for further processing while only having to instrument their application once. The different types of telemetry data that can be sent include traces, metrics, and logs. In this blog post, we will talk about sending traces to the AWS X-Ray backend.

The Go SDK for OpenTelemetry allows engineers to gather data from Go-specific applications. Once the data is collected, it is then sent to backends, such as Prometheus, Jaeger, and AWS X-Ray, to provide further insight into application failures and application performance metrics. What makes AWS X-Ray useful is that it allows engineers to observe their Go applications that are running on AWS-specific services, such as Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS).

To support sending traces from Go applications instrumented with OpenTelemetry to AWS X-Ray, we added four components to the OpenTelemetry Go SDK: the trace ID generator, propagator, ECS resource detector, and the EKS resource detector.

The diagram shows microservices and applications that are instrumented with OpenTelemetry sending traces to AWS X-Ray. The trace data is then aggregated and filtered in order to be analyzed by application developers.

The preceding diagram shows microservices and applications that are instrumented with OpenTelemetry sending traces to X-Ray. The trace data is then aggregated and filtered in order to be analyzed by application developers.

Building the ID generator

The first component we built was the trace ID generator. To build an X-Ray trace ID generator, it is helpful to first understand what a trace ID is. A trace ID is used to uniquely identify a trace in distributed tracing. This is useful because it can be used to group together all spans for a specific trace across all processes. The trace ID is a 32-hex-character lowercase string that is generated when a root span is created. For spans that have a parent, the trace ID is the same as the parent span’s trace ID because they both belong to the same trace.

In OpenTelemetry, the creation of OTLP trace ID uses the W3C trace format, which generates a random unique 32-hex-character lowercase string. However, to use OpenTelemetry tracing with X-Ray, we needed to override the OTLP trace ID creation function. This is because X-Ray does not use the W3C trace format; rather, it uses a different format in which the first 8-hex-digits represents the timestamp at which the trace is generated and the remaining 24-hex-digits are randomly generated.

Example of an X-Ray trace ID:

Example of an AWS X-Ray trace ID

In the diagram below, we show how the X-Ray ID Generator is used with client applications that are instrumented with OpenTelemetry. For client applications that use the OpenTelemetry Go SDK, a TraceProvider object is created in order to start collecting and sending trace data to other backends (e.g., Prometheus, Jaeger, etc.). Inside the TraceProvider object, we can specify which implementation of the IDGenerator to use.

Inside the TraceProvider object, we can specify which implementation of the IDGenerator to use.

We implemented the X-Ray ID Generator by making a single struct with two functions. The first function is NewSpanID(), which was responsible for generating a new spanID. For this function, we used the same implementation that OpenTelemetry uses. The second function that we had inside the ID Generator struct was the NewTraceID(). For this function, we used our own implementation for generating a new traceID. This is because X-Ray accepts a different traceID format than the standard W3C trace format.

AWS X-Ray accepts a different traceID format than the standard W3C trace format. Diagram shows NewSpanID() and NewTraceID()

In the following diagram, we summarize the steps that the NewTraceId() function takes to generate an X-Ray traceId. To get the first 8 digits, we convert the current human date timestamp to an epoch timestamp, which is then converted to hexadecimal. The remaining 24 digits belong to a randomly generated hexadecimal number. Together, we concatenate the first 8 digits and the remaining 24 digits to make up the 32-digit hexadecimal X-Ray trace ID.

Diagram shows the steps that the NewTraceId() function takes to generate an AWS X-Ray traceId. To get the first 8 digits, we convert the current human date timestamp to an epoch timestamp, which is then converted to hexadecimal. The remaining 24 digits belong to a randomly generated hexadecimal number. Together, we concatenate the first 8 digits and the remaining 24 digits to make up the 32 digit hexadecimal AWS X-Ray trace ID.

The source code for the ID generator component follows:

package xrayidgenerator import ( crand "crypto/rand" "encoding/binary" "encoding/hex" "math/rand" "strconv" "sync" "time" "go.opentelemetry.io/otel/trace"
) type IDGenerator struct { sync.Mutex randSource *rand.Rand
} // NewIDGenerator returns an idGenerator used for sending traces to AWS X-Ray
func NewIDGenerator() *IDGenerator { gen := &IDGenerator{} var rngSeed int64 _ = binary.Read(crand.Reader, binary.LittleEndian, &rngSeed) gen.randSource = rand.New(rand.NewSource(rngSeed)) return gen
} // NewSpanID returns a non-zero span ID from a randomly-chosen sequence.
func (gen *IDGenerator) NewSpanID() trace.SpanID { gen.Lock() defer gen.Unlock() sid := trace.SpanID{} gen.randSource.Read(sid[:]) return sid
} // NewTraceID returns a non-zero trace ID based on AWS X-Ray TraceID format.
// (https://docs.aws.amazon.com/xray/latest/devguide/xray-api-sendingdata.html#xray-api-traceids)
func (gen *IDGenerator) NewTraceID() trace.TraceID { gen.Lock() defer gen.Unlock() tid := trace.TraceID{} currentTime := getCurrentTimeHex() copy(tid[:4], currentTime) gen.randSource.Read(tid[4:]) return tid
} func getCurrentTimeHex() []uint8 { currentTime := time.Now().Unix() // Ignore error since no expected error should result from this operation // Odd-length strings and non-hex digits are the only 2 error conditions for hex.DecodeString() // strconv.FromatInt() do not produce odd-length strings or non-hex digits currentTimeHex, _ := hex.DecodeString(strconv.FormatInt(currentTime, 16)) return currentTimeHex
}

Propagator

The second component we built was the propagator, which is used in context propagation. What is context propagation? Simply put, propagation is the ability to correlate events across different services, and also is one of the core principles behind distributed tracing. In a distributed system, components need to be able to collect, store, and transfer metadata (also referred to as a context). For that reason, propagator objects are configured inside tracer objects in order to support transferring of context across difference processes. A context often will have information identifying the current span and trace, and can contain arbitrary correlations as key-value pairs. Propagation is when context is bundled and transferred across services, often via HTTP headers. This is done by first injecting context into a request and then this is extracted by a receiving service, which can then make additional requests, and inject context to be sent to other services, and so on. Together, context and propagation represent the engine behind distributed tracing.

As a result, the purpose of the propagator we built was to be able to support sending traces specifically from OpenTelemetry to X-Ray. We accomplished this by creating an X-Ray-specific propagator that translates the OpenTelemetry SpanContext into the equivalent X-Ray header format.

The X-Ray header format consists of three parts:

  • the root trace ID
  • the parent ID
  • the sampling decision

The following is an example of X-Ray trace header with root trace ID, parent segment ID, and sampling decision:

X-Amzn-Trace-Id: Root=1-5759e988-bd862e3fe1be46a994272793; Parent=53995c3f42cd8ad8; Sampled=1

Note:

  • The root is required, whereas the parent ID and sampling decision are optional.
  • Sampling decision refers to the algorithm used by X-Ray to ensure efficient tracing and provides a representative sample of the requests that your application served to determine which requests gets traced. By default, X-Ray records the first request each second, and five percent of any additional requests.
  • For more information, see Configuring sampling rules in AWS X-Ray.

When building the X-Ray propagator, we implemented three functions:

  • inject()
  • extract()
  • fields()

We implemented these functions inside a struct shown in the following diagram.

We implemented these functions inside a struct shown in the diagram: Propagator with inject(), extract(), and fields()

The first function, inject(), injects the X-Ray values into the header. Our implementation accepted two parameters, which are the context format for propagating spans and the textMapCarrier interface to allow our propagator to be implemented.

Next, the extract() function we implemented was required to extract the values from an incoming request. An example would be extracting the values from the headers of an HTTP request. Given a context and a carrier, the extract() function extracts the context values from a carrier and returns a new context. This new context is created from the old context along with the extracted values. For our Go SDK implementation, the extract() function accepted two parameters: the context and the textMapCarrier interface.

The final function in our propagator was the fields() function. This function refers to the predefined propagation fields. If the carrier is reused, the fields should be deleted before calling inject(). For example, if the carrier is a single-use or immutable request object, we don’t need to clear fields because they couldn’t have been set before. If it is a mutable returnable object, successive calls should clear these fields first. This will return a list of fields that will be used by the TextMapPropagator.

The source code for the propagator component is shown below:

package xray import ( "context" "errors" "strings" "go.opentelemetry.io/otel/propagation" "go.opentelemetry.io/otel/trace"
) const ( traceHeaderKey = "X-Amzn-Trace-Id" traceHeaderDelimiter = ";" kvDelimiter = "=" traceIDKey = "Root" sampleFlagKey = "Sampled" parentIDKey = "Parent" traceIDVersion = "1" traceIDDelimiter = "-" isSampled = "1" notSampled = "0" traceFlagNone = 0x0 traceFlagSampled = 0x1 << 0 traceIDLength = 35 traceIDDelimitterIndex1 = 1 traceIDDelimitterIndex2 = 10 traceIDFirstPartLength = 8 sampledFlagLength = 1
) var ( empty = trace.SpanContext{} errInvalidTraceHeader = errors.New("invalid X-Amzn-Trace-Id header value, should contain 3 different part separated by ;") errMalformedTraceID = errors.New("cannot decode trace ID from header") errLengthTraceIDHeader = errors.New("incorrect length of X-Ray trace ID found, 35 character length expected") errInvalidTraceIDVersion = errors.New("invalid X-Ray trace ID header found, does not have valid trace ID version") errInvalidSpanIDLength = errors.New("invalid span ID length, must be 16")
) // Propagator serializes Span Context to/from AWS X-Ray headers.
// Example AWS X-Ray format:
// X-Amzn-Trace-Id: Root={traceId};Parent={parentId};Sampled={samplingFlag}
type Propagator struct{} // Asserts that the propagator implements the otel.TextMapPropagator interface at compile time.
var _ propagation.TextMapPropagator = &Propagator{} // Inject injects a context to the carrier following AWS X-Ray format.
func (xray Propagator) Inject(ctx context.Context, carrier propagation.TextMapCarrier) { sc := trace.SpanFromContext(ctx).SpanContext() if !sc.TraceID.IsValid() || !sc.SpanID.IsValid() { return } otTraceID := sc.TraceID.String() xrayTraceID := traceIDVersion + traceIDDelimiter + otTraceID[0:traceIDFirstPartLength] + traceIDDelimiter + otTraceID[traceIDFirstPartLength:] parentID := sc.SpanID samplingFlag := notSampled if sc.TraceFlags == traceFlagSampled { samplingFlag = isSampled } headers := []string{traceIDKey, kvDelimiter, xrayTraceID, traceHeaderDelimiter, parentIDKey, kvDelimiter, parentID.String(), traceHeaderDelimiter, sampleFlagKey, kvDelimiter, samplingFlag} carrier.Set(traceHeaderKey, strings.Join(headers, ""))
} // Extract gets a context from the carrier if it contains AWS X-Ray headers.
func (xray Propagator) Extract(ctx context.Context, carrier propagation.TextMapCarrier) context.Context { // extract tracing information if header := carrier.Get(traceHeaderKey); header != "" { sc, err := extract(header) if err == nil && sc.IsValid() { return trace.ContextWithRemoteSpanContext(ctx, sc) } } return ctx
} // extract extracts Span Context from context.
func extract(headerVal string) (trace.SpanContext, error) { var ( sc = trace.SpanContext{} err error delimiterIndex int part string ) pos := 0 for pos < len(headerVal) { delimiterIndex = indexOf(headerVal, traceHeaderDelimiter, pos) if delimiterIndex >= 0 { part = headerVal[pos:delimiterIndex] pos = delimiterIndex + 1 } else { //last part part = strings.TrimSpace(headerVal[pos:]) pos = len(headerVal) } equalsIndex := strings.Index(part, kvDelimiter) if equalsIndex < 0 { return empty, errInvalidTraceHeader } value := part[equalsIndex+1:] if strings.HasPrefix(part, traceIDKey) { sc.TraceID, err = parseTraceID(value) if err != nil { return empty, err } } else if strings.HasPrefix(part, parentIDKey) { //extract parentId sc.SpanID, err = trace.SpanIDFromHex(value) if err != nil { return empty, errInvalidSpanIDLength } } else if strings.HasPrefix(part, sampleFlagKey) { //extract traceflag sc.TraceFlags = parseTraceFlag(value) } } return sc, nil } // indexOf returns position of the first occurrence of a substr in str starting at pos index. func indexOf(str string, substr string, pos int) int { index := strings.Index(str[pos:], substr) if index > -1 { index += pos } return index
} // parseTraceID returns trace ID if valid else return invalid trace ID.
func parseTraceID(xrayTraceID string) (trace.TraceID, error) { if len(xrayTraceID) != traceIDLength { return empty.TraceID, errLengthTraceIDHeader } if !strings.HasPrefix(xrayTraceID, traceIDVersion) { return empty.TraceID, errInvalidTraceIDVersion } if xrayTraceID[traceIDDelimitterIndex1:traceIDDelimitterIndex1+1] != traceIDDelimiter || xrayTraceID[traceIDDelimitterIndex2:traceIDDelimitterIndex2+1] != traceIDDelimiter { return empty.TraceID, errMalformedTraceID } epochPart := xrayTraceID[traceIDDelimitterIndex1+1 : traceIDDelimitterIndex2] uniquePart := xrayTraceID[traceIDDelimitterIndex2+1 : traceIDLength] result := epochPart + uniquePart return trace.TraceIDFromHex(result)
} // parseTraceFlag returns a parsed trace flag.
func parseTraceFlag(xraySampledFlag string) byte { if len(xraySampledFlag) == sampledFlagLength && xraySampledFlag != isSampled { return traceFlagNone } return trace.FlagsSampled
} // Fields returns list of fields used by HTTPTextFormat.
func (xray Propagator) Fields() []string { return []string{traceHeaderKey}
}

Amazon ECS resource detector

The third component we implemented was the Amazon ECS resource detector. Amazon ECS is a fully managed container orchestration service that makes it easy to run, stop, and manage containers on a cluster. A container defined in a task definition can be used to run individual tasks or tasks within a service. A service is a configuration that enables running and maintaining a specified number of tasks simultaneously in a cluster. Tasks and services can be run on a serverless infrastructure that is managed by AWS Fargate. The ECS resource detector is responsible for detecting whether or not an application or service that’s producing telemetry data is running on ECS.

diagram showing workflow explained in previous paragraph

The ECS resource detector determines whether the application generating telemetry data is running on Amazon ECS. It will then populate ECS-specific attributes in the resource object, such as the containerID and hostName. This comes in handy when trying to troubleshooting a failed request by pinpointing exactly which container was the root case.

Diagram shows: The ECS resource detector determines whether or not the application generating telemetry data is running on ECS. It will then populate ECS specific attributes in the resource object such as the containerID and hostName. This comes in handy when trying to troubleshooting a failed request by pinpointing exactly which container was the root case.

For our high-level design of the ECS resource detector, we had a struct containing three functions to help detect whether the application was running inside an Amazon ECS environment.

For our high-level design of the ECS resource detector, we had a struct containing three functions to help detect whether o not the application was running inside an ECS environment

In the following code block, we show an example usage of the Amazon ECS resource detector:

import ( "context" "go.opentelemetry.io/contrib/detectors/ecs"
) func main() { // Instantiate a new ECS resource detector detectorUtils := new(ecsDetectorUtils) ecsResourceDetector := ResourceDetector{detectorUtils} // If on ECS, resourceObj contains the containerID and hostName // If NOT on ECS, resourceObj is empty resourceObj, err := ecsResourceDetector.Detect(context.Background()) }

The full source code for the ECS detector can be found on GitHub.

Amazon EKS resource detector

The last component we implemented was the EKS resource detector. Amazon EKS is a fully managed Kubernetes service that allows you to run and deploy Kubernetes on AWS. Kubernetes is open source and allows organizations to deploy and manage containerized applications, such as platforms as a service (PaaS), batch processing workers, and microservices in the cloud at scale. Through Amazon EKS, organizations using AWS can get the full functions of Kubernetes without having to install or manage Kubernetes itself.

alolitas go support f10

What the EKS resource detector will do is detect whether or not the application that’s generating telemetry data is running on EKS and then populate EKS-specific attributes inside the resource object. These attributes include the containerID and clusterName. This comes in handy when trying to troubleshooting a failed request by pinpointing exactly which container was the root case.

alolitas go support f11

Note: For more details about detecting information for Elastic Kubernetes Service plugin, refer to the AWS X-Ray developer guide (PDF).

For our high-level design for the EKS resource detector, we had a struct containing member variables related to Kubernetes and authentication along with functions to help detect and create the resource object for supporting EKS detection with OpenTelemetry.

alolitas go support f12

The following diagram shows the flow of function executions for the EKS resource detector. In our implementation, we first invoke the detect() function, which checks whether the application is running on EKS. This is done by calling the isEks() function, which uses helper functions, including isK8s() and getK8sCredHeader(). Once the helper functions return, if isEks() is true—meaning the application is running on EKS—then it will call getClusterName() and getContainerId() to retrieve the attributes for the resource object. If isEks() returns false—meaning the application is not running EKS—then an empty resource object will be returned.

alolitas go support f13

In the following code block, we show an example of using the EKS resource detector to extract EKS specific attributes:

import ( "context" "go.opentelemetry.io/contrib/detectors/eks"
) func main() { // Instantiate a new EKS Resource detector detectorUtils := new(ecsDetectorUtils) eksResourceDetector := ResourceDetector{detectorUtils} // If on EKS, resourceObj contains the containerID and clusterName // If NOT on EKS, resourceObj is empty resourceObj, err := eksResourceDetector.Detect(context.Background()) }

The full source code for the EKS detector can be found on GitHub.

Testing strategy

The testing process that we used was test driven development (TDD). This is because TDD ensures that there are fewer bugs and has better code coverage. In TDD, code is only written after a test case is written, which ensures that the code being written is covered by a unit test. The process of TDD is shown below.

alolitas go support f14

Testing tools we used

We used two different tools for testing: the Go testing package and the Go testify library. We chose to use the testing package because this came by default with Go, which made writing and executing unit tests easy. We chose to use the testify library because this library added extra functionality that we could utilize in our unit tests, such as performing assert statements and mocking functions. In the next sections we will provide a brief overview of each of these tools.

Go testing package

This package is the standard unit test package for Go. It provides support for automated testing of Go packages and uses the go test command to run the tests. To use the package, we include it as an import in our test file (import "testing").

To write a new test suite, we followed these steps:

  1. Create a new file whose name ends with _test.go.
  2. Put the test file in the same package as the code.
  3. Write unit test functions inside our newly create test file.
  4. Execute the unit tests by using the following commands:
// Run only one specific unit test
$ go test -run // Run entire unit test file
$ go test 

To write a unit test, we used the following format:

func TestXxx(*testing.T){ // Unit test code goes here
}
  • The function name begins with the word Test.
  • The rest of the function name describes the unit test and uses camel-case starting with an uppercase letter.
  • The function must take an argument of *testing.T.

To learn more about using the Go testing package, visit the official package docs.

Go testify library

The testify library can be used in conjunction with the testing package in order to perform assertions and mocking.

An assertion is a Boolean expression at a specific point in a program, which will be true unless there is a bug in the program. A test assertion is defined as an expression, which encapsulates some testable logic specified about a target under test.

Benefits of assertions:

  • Detect subtle errors that might go unnoticed.
  • Detect errors sooner after they occur.
  • Make a statement about the effects of the code that is guaranteed to be true.

Here, in the following code, we show an example of using the testify library to perform an assert statement:

func TestSum(*testing.T){ expectedSum := 8 actualSum = computeSum(5, 3) assert.Equal(t, expectedSum, actualSum, "sum is incorrect")
}

Mocking with the testify library

Mock testing is an approach to unit testing that lets us test a specific service or component in isolation to other services and APIs. A typical example of where to use mock testing is if we are trying to unit test a function that makes an HTTP request, we wouldn’t want to make an HTTP request each time the test runs. Rather, we will mock the HTTP request and return a value that we expect. The benefit of this approach is that we are not dependent on testing whether the HTTP request works; rather, we can now test our function with the assumption that the HTTP request is successful.

To mock a function using the testify library, the following steps must be executed:

  1. In the main code file, create an interface that includes the list of functions to mock.
  2. In the main code file, create a struct that will implement the interface created in step 1.
  3. In the main code file, implement the interface by creating the functions defined in the interface.
  4. In the unit test file, create a mock struct.
  5. In the unit test file, create mock functions.
  6. In the unit test file, call the mock functions in the unit tests.

Example:

my_service.go

// step 1 interface
type Database interface { connect() error sendMessage(*string) error
} // step 2 struct
type SQLDatabase struct {} // step 3 implementing interface functions
func (db SQLDatabase) connect() error { // function implementation goes here
} // step 3 implementing interface functions
func (db SQLDatabase) sendMessage(*string) error { // function implementation goes here
}

my_service_test.go

import "github.com/stretchr/testify/mock" // Step 4 create a mock struct
type MockDatabase interface { mock.Mock
} // Step 5 create mock functions
func (mockDb *MockDatabase) connect() error { // mock function implementation goes here
} // Step 5 create mock functions
func (mockDb *MockDatabase) sendMessage(*string) error { // mock function implementation goes here
} // Step 6 use mock functions in unit test
func TestConnect(t *testing.T) { mockDb := new(MockDatabase) // When calling connect(), it will return nil mockDb.On("connect").Return(nil) // Rest of unit test...
} // Step 6 use mock functions in unit test
func TestSendMessage(t *testing.T) { mockDb := new(MockDatabase) // When calling sendMessage("test message"), it will return nil mockDb.On("sendMessage", "test message").Return(nil) // Rest of unit test...
}

Conclusion

While working on this project, we learned a lot about the process of developing and delivering high-quality open source code. Because OpenTelemetry is an open source project, we learned how to work with a large community of contributors, including developers from several organizations. We were able to learn and embrace the open source principles of transparency and discussion. Throughout this project, the key success factors included documenting the requirements, design, and test plan, along with joining SIG meetings and discussions to stay active.

Working with the OpenTelemetry community has been an amazing experience, and we invite you to join us. To learn more, check out the OpenTelemetry membership page. Join the discussions, test new features, and contribute your ideas and experience. We hope to see you there!

References

Wilbert Guo

Wilbert Guo

Wilbert Guo is a senior at the University of Toronto, majoring in computer science with a focus on artificial intelligence and software design. He is also a software engineering intern at AWS and is interested in observability and UI/UX design.

Kelvin Lo

Kelvin Lo

Kelvin Lo is a senior majoring in computer science at the University of British Columbia. He is currently working as a software engineer intern at AWS and is interested in observability and infrastructure.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.