In this guest post, Ciaran Kearney, Data Engineer at multinational telecommunications company BT discusses how BT built a monitoring solution using Amazon CloudWatch dashboards, composite alarms, and embedded metric format to support the monitoring of millions of devices.
Customers with high-cardinality monitoring use cases often face challenges when it comes to implementing observability. Monitoring high-cardinality workloads presents both cost and analytic challenges. When you combine Amazon CloudWatch embedded metric format (EMF) with composite alarms, you can monitor high-cardinality datasets effectively. Together, they provide:
- A high-level overview of your system.
- Insights into overall performance.
- The ability to quickly deep dive into unique device telemetry.
BT has millions of Smart Hub 2 devices that provide customers across the United Kingdom with broadband, Wi-Fi, and DECT phone connectivity. To help us provide a quality service and troubleshoot problems, these devices report call quality information back to AWS. We worked with AWS to build a system to monitor all of our devices in a single dashboard. This makes it possible for us to look for outliers for key metrics, such as latency, jitter, packet loss, and failed calls, while triaging issues to support individual customers.
We wanted to aggregate data into different dimensions to help us better understand the events generated. This allows us to perform complex analytics at each layer and closely manage costs. We also aggregate on call type: emergency calls, mobile, international, local calls, and direction, (inbound or outbound), and on termination cause, as per TR-069 specifications[i]. This aggregation allows BT to maintain telemetry granularity on a per-device basis in addition to aggregation points.
In the words of Ciaran Kearney, Data Engineer at BT, “The goal was to have a solution that was cost-effective and could handle the scale of monitoring we needed.”
We considered two approaches for monitoring our devices:
- Metrics – In this approach, we would ingest the telemetry as a metric. Due to its high cardinality, this approach would create a large cost, regardless of metric solution.
- Logs – In this approach, we considered ingesting the telemetry as a log. Although it would be more cost-efficient, the performance would be slower. Logs are slower, and the aggregations are typically computed during query execution time.
Working with AWS, we created a solution that uses both metrics and logs, which results cost efficiency and good performance. We use logs to store per-device telemetry. We use metrics to create three levels of aggregation (country, geo, locale). We accomplished this through the use of EMF and composite alarms.
Let’s take a closer look at how we implemented this solution.
We needed to ingest the call quality information at a per-device granularity to assist our customers when troubleshooting issues. In addition, we wanted to aggregate the data to alert us when multiple customers might be impacted by an issue on our network. To do this, we chose CloudWatch EMF. For more information about this feature, see the Enhancing workload observability using Amazon CloudWatch Embedded Metric Format blog post.
We were able to ingest logs from all our devices, and record specific events with the relevant metrics, dimensions, and metadata (for example, latency or jitter). Logs and metrics are both important. They serve their own unique purpose. We use metrics to provide data at an aggregated level to provide a view of the overall service. We use device-specific logs to enable per-customer troubleshooting and support when needed.
CloudWatch EMF saved us manual effort because it extracts the custom metrics for us. We just needed to create the visualizations and alarms based on our key call quality measures. This reduced the development overhead and time required to bring the service live. There was no need to implement any custom PutMetricData API calls because CloudWatch extracts the metric data from the logs, which helps us save money on the solution.
The following is an example of our EMF configuration in which we are store callSuccess and callFailure events as CloudWatch metrics, aggregated by region such London, Northern Ireland (e.g. [“region”]), and all of the UK (e.g. ). The individual telemetry for each device is stored as log data in CloudWatch Logs.
Figure 2 shows the EMF configuration. For brevity, we removed a large number of metrics that were not turned into CloudWatch metrics and remained logs.
By ingesting the data as a log, and then converting it to a metric as an aggregated data point, we were then able to use CloudWatch Logs Insights to dive deep into specific device metrics or log data. Each Smart Hub 2 device sends a lot of additional call quality data, which wouldn’t make sense as a metric dimension, but remains available in the log data as a property in the EMF specification.
We also used CloudWatch composite alarms to map alarms to different levels of aggregation with CloudWatch Logs Insights. For more information, see the Improve monitoring efficiency using Amazon CloudWatch composite alarms blog post. Composite alarms mirror the aggregation that we created using EMF. We didn’t have to build new aggregations for our alarms, which saved us time and removed complexity. We then took this solution a step further by having composite alarms of composite alarms, matched up to the aggregation levels that we had defined. As an example, if an alarm triggers at the lowest level of aggregation, the alarm flows up to the highest level of aggregation as a warning. If multiple alarms are triggered, the alarm flows up to the highest level of aggregation as critical.
In this post, I’ve shown how CloudWatch provided BT with a cost-effective and powerful monitoring solution that allows us to monitor millions of Smart Hub 2 devices across the UK. Using EMF and composite alarms, we can ingest per-device log data and create a hierarchy of aggregated metrics. We then represent these metrics in a corresponding hierarchy of alarms.
As Ciaran Kearney says, “Within the first two weeks of operation, the CloudWatch solution alerted our operations teams to a major outage within moments of it occurring. Our existing systems would have taken much longer. This now allows us to react faster and put things right sooner.”
This monitoring solution meets our needs and requirements. We’ll be deploying similar solutions across more of our products so that we can provide a better service to our more than 30 million customers. By taking this scale into consideration when we designed this monitoring solution, and the additional cardinality it will bring, we have a design that delivers today and is future-proofed for scaling to many more millions of devices.
 TR-069 is a specification which defines application layer protocols across customer premises equipment
About the authors
Ciaran Kearney is a Data Engineer at multinational telecommunications company BT, based in Belfast.
Andrew Robison is a Principal Solutions Architect at Amazon Web Services, based in the UK. As part of the Well-Architected team, Andrew is the Geo lead for Well-Architected across EMEA. He works with AWS Partner Network partners and customers of all sizes to help them build secure, high-performing, resilient, and efficient infrastructure for their applications.
Greg Eppel is a Principal Solutions Architect for Observability with AWS and lives in the Houston metro area. He has been assisting AWS customers since 2016 and has been using AWS since 2008. He has over 15+ years of developer and operations experience primarily with Microsoft technology and prior to joining AWS was a CTO for a SaaS company in Vancouver, BC..