Amazon Customer Service solves exciting and challenging customer care problems for Amazon.com, the world’s largest online retailer. In 2021, the Amazon Customer Service Technology team upgraded its dense-compute nodes (dc2.8xlarge) to the Amazon Redshift RA3 instance family (ra3.16xlarge). Moving to the most advanced Amazon Redshift architecture enabled the team to reduce its infrastructure costs, improve the performance of queries, optimize its compute and storage capacity planning processes, and accelerate the performance of analytical dashboards for making business-critical decisions.

The RA project was a major strategic move for the Amazon Customer Service Technology team. The team operated 8 DC2 nodes, but they required 6 nodes of computing power and 10–12 nodes of storage unit capacity. Upgrading to 3 RA3 nodes allowed the team to not only save up to 55% on Amazon Redshift operating costs per year, but also improve the load times of most dashboards up to 47% and the data retrieval performance of most queries up to 25%. The new nodes enabled the team to plan and scale storage resources flexibly without having to purchase additional compute, and they can continuously innovate with the highest standards for their customers.

General challenges

As the Amazon Customer Service data warehouse continues to grow in volume, additional storage is required to store the new information. In fact, Customer Service’s storage capacity needs have grown faster than computing power needs. However, Customer Service has been using dense-compute nodes over dense-storage nodes to support real-time analytical dashboards that require high-speed retrievals using NVMe-based SSDs offered exclusively by the dense-compute nodes.

The previous generation of Amazon Redshift instances didn’t allow you to scale compute and storage resources separately because compute and storage were tightly coupled in a node. As a result, more compute was purchased whenever additional storage was required. This led to an increase in the operating costs for supporting mission-critical applications. In addition to the pain of purchasing extra compute, the team faced some challenges:

  • High disk usage and high storage utilization – The storage utilization of the Amazon Redshift cluster crossed 90% because each dc2.8xlarge node had less SSD space (2.56 TB). This caused high disk usage per node whenever disk-intensive queries were run (and these queries were already optimized). A healthy Amazon Redshift cluster must not have more than 80% of storage utilization, but this threshold had been crossed.
  • Inability to add more storage without resizing the cluster – Because each dc2.8xlarge node had a fixed storage unit capacity, the only way to add additional storage was by resizing the cluster. About three times the amount of ad hoc storage was required on a frequent basis, to onboard large tables to the cluster, and it became difficult to add these tables with only 8 DC2 nodes. An elastic resize of the DC2 instance family has a limitation of twice the possible number of compute nodes, and a classic resize can cause operational issues when the cluster is put into read-only mode.
  • Difficulty to control the cluster’s operating cost – Because 6 nodes of computing power were required instead of 10–12 nodes, there was no requirement to purchase additional compute for the growing storage needs. The CPU utilization was low-to-moderate, which made it unjustifiable to purchase more compute while purchasing additional storage. The expense of the compute was quite high, which made it difficult to control the operating cost of the cluster.

In December 2019, AWS introduced the new RA3 nodes that take advantage of the next generation of compute nodes, powered by the Amazon EC2 R5 Nitro instances and supplemented by a managed-storage model that gives you the ability to independently optimize your compute and storage resources. The RA3 nodes have several architectural improvements to manage your storage using the AWS Nitro System in conjunction with Amazon Simple Storage Service (Amazon S3) and the Advanced Query Accelerator (AQUA), which is backed by a distributed and hardware-accelerated high-speed cache, which enables Amazon Redshift to run 3–10 times faster for certain types of queries. The new nodes are optimized for extract, transform, and load (ETL) workloads as well as for operational analytics, and they are backward-compatible with all existing Amazon Redshift features.

The RA3 nodes helped the Amazon Customer Service Technology team optimize their compute and storage capacity planning processes independently. The cluster was allotted a storage unit quota called Redshift managed storage (RMS) in Amazon S3, for every compute node in production, and the business was charged on an hourly basis for using a portion of the quota, instead of having to purchase the entire storage unit. A permanent copy of the data resided in the RMS and the records required by queries were copied to the SSD-based caches of the local nodes over an ultra-high-speed network. Using the RA3 nodes enabled the Amazon Customer Service Technology team to improve its Amazon Redshift operating efficiency and reduce its cost of operations and to deploy some of the new Amazon Redshift features to production, to allow the business to perform better than before.

Design considerations

The Amazon Customer Service Technology team considered the following design factors while moving to the RA3 instance family:

  • Selecting the correct RA3 instance type – Consider the cluster’s bandwidth and operating costs over most of the other AWS recommendations. If the operating costs are equal and the cluster has steady-compute requirements, select the larger instance type ra3.16xlarge over the smaller ra3.4xlarge. The larger instance provides four times the I/O bandwidth of the smaller instance (8.0 GB/s versus 2.0 GB/s). The higher bandwidth makes sure that the Amazon Redshift cluster hydrates its data more quickly. However, if the operating costs are a concern, test the workload with the smaller instance before adopting it, to ensure that the cluster doesn’t face performance issues pertaining to the I/O bandwidth.
  • Ascertaining the type of resize for the migration – A classic resize resets the number of slices to its default value for any instance, thereby affecting the total number of slices in the cluster, whereas an elastic resize redistributes the existing slices across the new nodes, without affecting their total number. Classic resizes put the cluster into read-only mode, which causes the writes to stop in the process, but elastic resizes don’t put the cluster into read-only mode when it’s modified. Moreover, classic resizes take longer to complete, typically 2 hours to 2 days, whereas elastic resizes are shorter and take only 10–20 minutes. We recommend using an elastic resize over a classic resize for performing the upgrade.
  • Deciding the correct node-count ratio – Unlike the existing generation of DC2 nodes, the RA3 nodes support elastic resizes that can be up to four times the current number of compute nodes. We particularly found that a 3:1 node-count ratio (also called a migration ratio) allowed us to optimize our operating cost for our performance needs, but we recommend that you consider the RA3 sizing guidelines while optimizing your node-count ratio for your performance needs.
  • Choosing the correct workload management configuration – The RA3 nodes have improved compute and storage specifications because their technology is more advanced than that of the DC2 nodes. Choosing the right automatic workload management (WLM) configuration with two or more machine learning (ML) queues can enable a business to harness the true potential of the new architecture. With the automatic WLM, memory allocation and concurrency are automatically decided by the Amazon Redshift cluster using ML algorithms.
  • Setting the correct query monitoring rules – The query monitoring rules (QMRs) can help the Amazon Redshift cluster mitigate the cost of ill-performing queries that over-utilize its resources, such as CPU and cache, memory, disk, and Redshift Spectrum. In the new architecture, any query that spills data to disk actually spills data to the RMS in Amazon S3. Therefore, it’s essential to limit the data spill to no more than 1 TB per ra3.4xlarge node and 4 TB per ra3.16xlarge node. For example, a 3-node ra3.16xlarge cluster should not spill more than 12 TB of data to the RMS. To enforce such a rule, set up a QMR that can check the data spill of each query to the RMS and limit the spill to no more than the threshold of the QMR because that spill can deteriorate the performance of queries. Any query that breaks the QMRs should be considered for performance tuning.
  • Optimizing the number of manual snapshots – Scheduled automated snapshots are stored in the backup storage system free of cost. Although you’re not charged a monthly fee for storing them, you are charged for copying the automated snapshots to manual snapshots or for creating new manual snapshots. Copy an automated snapshot to a manual snapshot only when you need to retain the automated snapshot for more than 35 days, and try to optimize the number of manual snapshots (and their retention periods) to minimize the cost of storing them.
  • Performing pre- and post-upgrade activities – Take one manual snapshot of the cluster before the migration and tag it for future references. This snapshot serves as an artifact for troubleshooting any issues that may arise in the future. Provide this snapshot to AWS support via the Amazon Redshift service console, if requested. Drop all unwanted objects from the Amazon Redshift cluster, because they occupy the RMS space and increase the cost of storing the data. Perform a vacuum on all of the tables to ensure that they release their reclaimable space. Run an analyze on all of these tables after the migration to refresh their statistics for the query planner.
  • Implementing the Redshift advisor’s recommendations – Carefully check the recommendations of Amazon Redshift Advisor after performing the migration. Specifically, check those recommendations that have a high-to-medium impact because these can unfold some areas of improvement. Examples are distribution and sort key improvements, for controlling the data skew and I/O skew or for improving the performance of tables, and the Amazon S3 copy operations.

Benefits

The Amazon Customer Service Technology team realized several benefits by adopting the ra3.16xlarge instance:

  • More than 90% increase in total storage unit capacity – The storage unit capacity of the Amazon Redshift cluster increased by approximately 18 times as compared to what it was earlier. This reduced the engineer’s burden of housekeeping the cluster and continuously monitoring its disk usage via Amazon CloudWatch.
  • More than 40% reduction in unwanted compute capacity – With 3 RA3 nodes, the new Amazon Redshift cluster has an approximately 38 times larger SSD-based total cache and 43% smaller total number of vCPUs, helping the team to not only support faster disk-based I/O but also save on the undesirable cost of compute.
  • Dramatically faster disk read and write operations – The new RA3 nodes have faster SSDs than the DC2 nodes, allowing the new cluster to be faster than its predecessor. Most of the queries have improved in performance up to 25% without the need for further tuning, thereby allowing the team to be more efficient at operational analytics.
  • Separation of compute and storage unit capacities and costs – Because a permanent copy of the data no longer exists in the compute nodes, they’re used exclusively for computing the results of queries. The cluster’s storage unit now resides in the RMS, backed by Amazon S3, and is billed at a monthly rate, depending upon its hourly usage. The new cluster can scale compute or storage resources independently, thereby helping the business improve at capacity planning and control the total cost of operations.
  • Cost savings from moving Redshift Spectrum tables to the RMS – The team had offloaded many frequently accessed tables to Amazon S3 due to the fixed storage unit capacity of the DC2 nodes and were accessing the data using Redshift Spectrum, incurring additional scanning charges in the process. Those tables are now moved to the RMS, helping the team prevent over 40,000 Amazon Spectrum scans and therefore save money.
  • Slight increase in the I/O bandwidth of the cluster – The I/O bandwidth of an Amazon Redshift cluster plays a pivotal role in determining the total stress that it can sustain without impacting performance. The ra3.16xlarge instance has a higher I/O bandwidth (6.67% more) than the dc2.8xlarge instance, making the new cluster more resilient to stress than its predecessor. The business now has a more fault-tolerant Amazon Redshift cluster than before.
  • More efficient data syncs with some internal services – The RA3 nodes use large high-speed caches to reduce the time taken by internal syncs up to 45%. This indicates that the data is readily available to stakeholders whenever required, making them more effective and efficient at work.
  • Faster analytical dashboards than before – Most analytical dashboards load up to 47% faster after moving to the RA3 instance family. These dashboards are used extensively by the business for data-driven decision-making. The faster load time has a positive impact on the ability of the business to make critical or time-sensitive decisions.
  • High availability architecture for Multi-AZ deployments – The cross-AZ cluster relocation feature allows the new Amazon Redshift cluster to switchover to new Availability Zones when a configuration issue occurs in an Availability Zone. This enables the team to develop a business continuity plan for maximizing the availability of the cluster by deploying an automatic failover solution that can provide 99.99% of Amazon Redshift availability using CloudWatch and AWS Lambda.
  • Support for Amazon Redshift cross-database queries and data sharing – The new RA3 nodes allow you to perform joins on tables across different databases and different clusters using the cross-database queries and the data sharing features, thereby enabling the team to join and query datasets among multiple clusters free of cost.
  • Unlimited code cache and improved cold query performance – With the improved cold query performance update, the Amazon Redshift cluster can process queries much faster when they need to be compiled. The unlimited code cache for the leader node can store compiled objects to increase the cache hits from 99.60% to 99.95%. This has dramatically improved the performance of business-critical queries (up to two times greater) for the team free of cost.
  • More environment-friendly Amazon Redshift than before – With the separation of storage and compute, capacity optimization has improved by 90%. This has a positive impact on the environment and allows Amazon.com to build a more sustainable business.

Summary

Amazon Customer Service Technology solves exciting and challenging customer-care problems for Amazon.com, the world’s largest online retailer. The recent switch to RA3 nodes had the following benefits:

  • Reduced the Amazon Redshift total cost of operations by up to 55%
  • Increased the performance of most of the Amazon Redshift queries by up to 25%
  • Controlled the compute and storage capacities and costs, independently
  • Improved the load times of most analytical dashboards by up to 47%

The Amazon Customer Service Technology team continuously develops products for Amazon.com to systematically eliminate defects impacting customers, identify possible issues before they occur, and create self-service and automation to make it easier for customers to interact with Amazon. The Amazon Redshift RA3 nodes empower the team to continue solving challenging customer problems and entrust AWS customers to consider moving to the new RA3 instance family when storage and compute capacity planning processes have to be independently optimized to achieve a better price-to-performance ratio than the dense-compute nodes.


About the Author

omkar sunkersett 100 2Omkar Sunkersett is a Data Engineer at Amazon.com based in Seattle, WA. He builds and manages data-driven solutions for product and operational analytics, working together with a diverse and talented team of scientists, engineers and product managers, in collaboration with other experts of the Amazon Customer Service Technology organization.