Amazon CloudWatch provides a mechanism to publish metrics through logs using a format called Embedded Metric Format (EMF). You can use this to ingest complex application metric data to CloudWatch along with other log data. Although you can use this feature in all environments, it’s particularly useful in high-cardinality environments such as AWS Lambda functions and container environments such as Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), or Kubernetes on EC2. CloudWatch automatically extracts custom metrics from the log data, which you can use to visualize and create alarms. In addition, you can query the detailed log events associated with the extracted metrics using CloudWatch Logs Insights to get deeper insights and perform analysis.

At FireEye, we used this functionality to improve efficiency, optimize costs, and effectively serve our customers.

The problem

During early 2020, during one of our technical discussion meetings, we reviewed our architecture used to collect and publish metrics from our application environment to CloudWatch. We identified a potential opportunity to optimize the design to improve on performance, cost, and efficiency.

The following diagram shows our original architecture, in which the application receives data from many customers, and we track performance using CloudWatch metrics. Here, we call the PutMetricData API to ingest metrics into CloudWatch each time the Lambda function is invoked.

Existing architecture for metric ingestion used by FireEye

This powered all our telemetry and provided data for our dashboards, reporting, CloudWatch alarms, and more for the application. We love this because it simplifies so many of the aspects of operating an application. The application receives data via Lambda functions, invoked from Amazon API Gateway, and cross-account Amazon S3 notifications to process new files created in customer S3 buckets.

Our application environment is busy, with thousands of requests and events that invoke Lambda functions every minute. During each invocation, we called the CloudWatch PutMetricData API to send metric data to record how much data the function had processed. This resulted in two problems for us:

  • The runtime cost of the Lambda function increased due to the additional execution time of calling the PutMetricData API.
  • Although the PutMetricData API call isn’t expensive ($.01 per 1,000 requests), we could reduce costs further, given the scale at which we were using Lambda and the API calls.

While assessing potential options to improve the architecture, we analyzed several solutions that came with different cost-benefit trade-offs. At some point, we even considered building our own metrics-gathering toolset, but this would have provided no direct benefit to our customers and distracted from our mission. We determined that we didn’t want to invest significant time and cost in building and operating a custom-built metrics-collection solution ourselves, but preferred to focus our efforts on the features that mattered most to our customers.

The solution

When we learned that AWS had launched Embedded Metric Format (EMF) for CloudWatch, we knew we could use EMF to solve our problems.

We swapped a single line of code in the Lambda function calling the PutMetricData API, and changed it to a simple print statement. Like magic, we had the same stats in CloudWatch at a 65% cost reduction, dropping the application’s overall cost by more than a third, with just a few minutes of development time. The real surprise was that the operating cost savings turned out to be the least significant win.

The following code is an example metric statement (in Python) used in our Lambda functions:

print(json.dumps({ "_aws": { "Timestamp": int(time.time() * 1000), "CloudWatchMetrics": [ { "Namespace": “OurCustomStats”, "Dimensions": [ ["Customer", "Identifier"] ], "Metrics": [ { "Name": "Events", "Unit": "Count" } ] } ] }, "Customer": customer_name, "Identifier": customer_id, "Events": float(count) }))

 

The following diagram illustrates our updated architecture after implementing EMF. In this new design we simply write our metrics into CloudWatch Logs in EMF format. CloudWatch automatically extracts metrics out of the log events and makes them available under CloudWatch Metrics.

New architecture used by FireEye using EMF log format

 

What happened next was transformational for the application and our understanding of performance monitoring.

The benefits

We enjoyed the 65% lower cost for metrics data collection, but we discovered another benefit. Occasionally, we wanted to dive into a detailed investigation of an operational anomaly. We quickly discovered that we could use CloudWatch Logs Insights to make programmatic, fine-grained queries against the JSON written into CloudWatch Logs. Furthermore, we could easily add any of those queries to a CloudWatch dashboard, so its query results appeared alongside the existing CloudWatch custom metrics that were being automatically collected. This let us do things like easily create aggregate metrics that don’t normally fit inside a CloudWatch metrics dashboard.

Better metrics were nice, but what really changed our general approach was realizing that we could use this technique to write out opportunistic metrics. These were were fine-grained, debug-level performance stats ingested as JSON in CloudWatch Logs. If we had a performance question, we could run simple CloudWatch Logs Insights queries against the recorded data. Before using this technique, it was cost-prohibitive to record separate CloudWatch metrics, for all these potential values; especially given that we had no way of knowing how many we needed or how often.

The following code is an example CloudWatch Logs Insights query, to find the percentage of time our performance statistic recorded reading a customer Amazon S3 key:

fields @timestamp, @message
| filter @message like '"read_key"'
| stats sum(read_key)/sum(total)*100 as read_percentage by bin(5m)

Shifting to a “log everything” approach for JSON metrics allowed us to add things like low-level timing values to our Lambda function code. This lets us dive into the data on a per-customer level,  discover significant findings by ingesting high cardinality telemetry as log data, and have control over what gets created as a metric.

For example, based on low-level performance timing values, we discovered that a single customer’s data transfer was using more than half the total Lambda function time in one Region. We realized we could save about half the cost of operating the application in that Region by creating a bucket replication mechanism.  That way, the function wasn’t idling while the objects were being transferred.

This also gave us the performance insight that allowing customers to use buckets in Regions requiring data transfers might be an unsupportable business function, and on-boarding instructions and documentation could save us from this costly setup in the future. Without the detailed performance metrics available, we would have erroneously assumed that the costs were inherent to the algorithms in the application.

The result of this discovery was that most of our function code spent its time waiting for things. This suggested that spending time improving algorithmic code performance was far less valuable than architectural improvements, such as additional caching.

We discovered several ways to publish log and metric data to CloudWatch, each of which can be beneficial depending on the use case. For FireEye, we have a high-cardinality environment and want to send log and metric data quickly, with negligible performance impact to the application.

If you write JSON to CloudWatch Logs, you can query it at any time with Insights to do reporting, and you only pay for the query, not the cost of the metrics. If you want to use them as metrics later, complete with alerting, you can do so with metric filters. You can start with JSON logging and graduate mission-critical queries to full CloudWatch Metrics. You can also simply use the PutMetricData API to send metric data directly to CloudWatch metrics, if your use case demands it.

Conclusion

Armed with these performance stats on dashboards, our Site Reliability Engineering (SRE) team could stay on top of tuning values, while also using the raw data for debugging and developer-led performance evaluations. This ensured that our staff was always directly serving our customers, by focusing on accelerated service delivery instead of infrastructure – fulfilling the promise of the cloud.

About the author

Martin Holste

As CTO for Cloud at FireEye, Martin Holste is responsible for shaping cloud security offerings, developing the corporate cloud security strategy, and passionately working with customers to help secure their cloud workloads. Prior to serving as CTO for Cloud, Martin lead the team that built the FireEye Advanced URL Detection Engine in the cloud and founded the cloud-native Helix Platform at FireEye.

Martin is a regular contributor to FireEye blogs, webinars, and ebooks, and has appeared on panels, podcasts, and given presentations covering serverless Big Data solutions, cloud security, and security investigation cognitive strategies at AWS re:Invent, RSA Conference, Blackhat, and other security conferences.

The content and opinions in this blog are those of the third party author and AWS is not responsible for the content or accuracy of this post.