Debugging performance issues is always a challenge, especially in production. In the normal software development workflow, as we start writing and reviewing our code, we may introduce bugs or inefficiencies that lead to decreased overall performance of our application after deployment. So, we created Amazon CodeGuru, a developer tool powered by machine learning (ML) that identifies an application’s most expensive lines of code and provides intelligent recommendations for improving code quality.

Amazon CodeGuru supports two distinct features that augment the different steps in the application development cycle:

  • Amazon CodeGuru Reviewer – Uses ML to identify critical issues and hard-to-find bugs during application development to improve code quality.
  • Amazon CodeGuru Profiler – Helps developers find an application’s most expensive lines of code and provides specific visualizations and recommendations on how to improve code to save money.

In this post, we discuss how an internal development team at Amazon used CodeGuru Profiler to improve their code quality and how this helped improve their customer experience.

Why I should use Amazon CodeGuru Profiler?

CodeGuru Profiler identifies the most expensive lines of code and recommends ways to fix them to reduce CPU utilization, cut compute costs, and improve application performance. Profiler continuously analyzes application CPU utilization and latency characteristics to show where you’re spending the most cycles or time in the application. Profiler presents analysis for CPU utilization and latency characteristics in an interactive flame graph that helps you visually understand which code paths consume the most resources, verify that your application is performing as expected, and uncover areas that can be optimized further.

Also, CodeGuru Profiler automatically identifies anomalies and performance issues in the application and provides intelligent recommendations on how to remediate them. These recommendations help identify and optimize resource-intensive methods within the code, reduce the cost of infrastructure, reduce latency, and improve overall end-user experience. CodeGuru Profiler continuously analyzes application profiles in real time and detects anomalies in the application’s methods. Each anomaly is tracked in the recommendation report, and you can see the time series of how the method’s latency behaves over time with anomalies clearly highlighted. You can also configure Amazon Simple Notification Service (Amazon SNS) to send notifications when a new anomaly is detected.

Problem

Our team works with enterprise ebook publishers to make books available to anybody, anywhere. Our systems must be reliable, and performance and scalability are frequent discussion topics. We consistently operate on extremely large datasets that are the backbone of publishers’ catalogs. When our metrics indicate performance issues, we take them very seriously.

Around the spring of 2020, we noticed increasing performance issues in one of our systems that is tasked with aggregating large amounts of data and transforming it into a downloadable Excel spreadsheet. This export process is critical to our publishers because it allows them to analyze and adjust their entire catalog of books in a manual or automated fashion. Over time, our export system was showing signs of poor performance when provided large datasets. On these extremely large datasets, we began to see consistent failures and timeout errors. To not further lose customer trust, we prioritized an investigation into the performance of this system.

Solution

As a first step, we looked at standard JVM metrics through various Java Management Extensions (JMX) tools. The JMX tools gave us a glimpse into the memory profile of our system, detailing what types of objects were allocated, how big they were, and how long they were allocated for. This showed typical values that you expect from a system churning through hundreds of gigabytes of data per hour. There were no signs of an originally suspected memory leak, or any type of resource mismanagement. After finding no helpful data regarding our performance issues, we turned to CodeGuru Profiler for a better look inside the system.

Given the minimal package size of CodeGuru Profiler and its impressively small performance profile, the decision was easy, and we were able to deploy it onto all of our hosts in under an hour. We were excited to see the profiler data within 10 minutes, and decided to let it run for a few days for the profiler to characterize the application performance further and build a comprehensive overview of thread data from jobs created by customers in various marketplaces and time zones. After 2 days, we checked our flame graph on the CodeGuru Profiler dashboard, and the problem was painfully obvious. Nearly 90% of the total time was spent running a specific set of data transforms. CodeGuru Profiler even told us that this specific transform set was costing us hundreds of unnecessary dollars per year. We deployed a small code change to log the runtime of each transform in the set identified by the CodeGuru Profiler.

The culprit immediately revealed itself as a legacy transform hogging over 75% of the total CPU time. After digging into this transform’s code, we found complex, repeated iterations written in a way that didn’t meet our coding standards. We rewrote this transform using Java, confirmed its accuracy, and deployed to prod. The new transform kicked into high gear after the deployment, and we could see our flame graph transform. The peaks where the old transform used to be began to shrink. During final production testing, we could tell that it even felt faster.

We waited a few weeks to gather large amounts of data to accurately determine the system’s performance enhancement. When we analyzed the runtime before and after the new transform deployment, we found that the transform had improved by 1600 times, driving its impact on CPU utilization to near-zero, ultimately allowing us to increase the work on these CPUs by 4 times.

The following graph shows the week prior to and after the deployment. The purple data points are the week leading up to the deployment, and the blue is the week immediately following the deployment. The speedup can be calculated by comparing the average of purple divided by the average of blue (3,454,099/845,340).

 

Figure 1 – Average Export Time

Figure 1 – Average Export Time

This significant improvement was even noticed by our customers without us telling them— they came to us elated that their large exports finished long before they expected them to. Our decision to enlist the help of CodeGuru Profiler ultimately brought us straight to the source of the problem with only a little bit of work on our end, allowing us to visually debug an otherwise invisible bug. With nothing more than a quick configuration change, we were quickly on our way to fixing current and future system critical issues.

The following graph shows the average time spent processing data before and after the deployment (marked in red). Large spikes are anomalous artifacts of a p100 metric.

 

Figure 2 - Average processing time

Figure 2 – Average processing time

Conclusion

After learning of a performance bug plaguing our export system, we were initially concerned with the inherent complexity and uncertain nature of tracking it down. After exploring the debugging solutions AWS offers, we were easily convinced to give CodeGuru a shot. With the low profile and minimal overhead of CodeGuru Profiler, the cost far outweighed the benefits. After a quick deployment, we were on our way to a solution faster than we realized. After letting our thread metrics soak, we could easily identify the problematic code, deploy a fix, and maintain customer trust.

Not only did we find CodeGuru Profiler helpful in this scenario, we’re also using it as a debugging tool that’s constantly running, collecting data, and preparing us for possible future performance degradations. Amazon CodeGuru is the lightweight debugging tool that can provide you continuous feedback on improving application performance across code iterations.