Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL. Amazon Redshift offers up to three times better price performance than any other cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as high-performance business intelligence (BI) reporting, dashboarding applications, data exploration, and real-time analytics.

Amazon Redshift introduced support for natively storing and processing HyperLogLog (HLL) sketches in October 2020. HyperLogLog is a novel algorithm that efficiently estimates the approximate number of distinct values in a dataset with an average relative error between 0.01–0.6%. An HLL sketch is a construct that encapsulates the information about the distinct values in the dataset. You can generate, persist, and combine HLL sketches. HLL sketches allow you to achieve significant performance benefits for queries that compute approximate cardinality over large datasets, as well as reduce resource usage on I/O and CPU cycles.

In this post, I explain how to use HLL sketches for efficient trend analysis against a massive dataset in near-real time.

Challenges

Distinct counting is a commonly used metric in trend analysis to measure popularity or performance. When you compute the distinct count by day, such as how many unique people search for a product or how many unique users watch a movie, we can see if the trend goes upward or downward and use this insight to drive decisions.

It looks like a simple problem, but the challenge grows quickly as data grows. Counting the exact distinct number can use lots of resources (due to storing each unique value per grouping criteria) and take a long time even when using a parallelized processing engine. If a filter or group by condition changes, or new data is added, you have to reprocess all the data to get a correct distinct count result.

Solution overview

To solve those challenges, you can use probabilistic counting algorithms such as HyperLogLog. It makes a small trade-off in accuracy for a large gain in performance and also reduces resource utilization. HyperLogLog is a novel algorithm that efficiently estimates the approximate number of distinct values in a dataset. When data is massive, a very small difference in the final result doesn’t change the trend. However, high performance is desired to enable interactive, near-real-time analysis.

Another appealing feature of HyperLogLog is that the intermediate results are mergeable. A HyperLogLog sketch is a construct that encapsulates the information about the distinct values in the dataset. This provides great performance and flexibility. You can compute an HLL sketch at the finest grouping granularity desired such as daily, then every 7 days merge the daily HLL sketch result to get the weekly approximate distinct count. You can also apply a filter or group by at a higher level to further aggregate the HLL sketch. By incrementally computing HLL sketches for newly added data, you can compute approximate distinct counts for years of historical data much quicker, which supports your interactive trend analysis. For a time-series dataset that grows rapidly, this algorithm is more attractive. For more information, see Amazon Redshift announces support for HyperLogLog Sketches.

The HyperLogLog in Amazon Redshift capability uses bias correction techniques and provides high accuracy with a low memory footprint. The average relative error is between 0.01–0.6%, which can meet even critical analysis needs. The following queries get the top 20 products against 100 billion sales records. Let’s compare the results to see the error rate:

SELECT product, Count(DISTINCT order_number)
FROM store_sales
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20; Runtime: 50 seconds

If we use HLL() instead, the query completes much faster and uses fewer resources:

SELECT product, Hll(order_number)
FROM store_sales
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20; Runtime: 22 seconds

The following table summarizes our findings.

productexact_cntHll_cnterror_rate
1751326870832601320.00262
21001326622832958540.00907
3501326320432796790.00505
41251326168332747010.00399
51501325778232953460.01153
61505325643132745110.00555
72003325588832860520.00926
81753325281332523430.00014
9253325117032481590.00093
10503325036932340220.00503
11753324907032542350.00159
122251324394332743220.00936
13251324244332142790.00869
145324193832709530.00895
151751324061732601320.00602
16505324034632359950.00134
171323796332104930.00848
182253323691132204780.00508
193323588932685340.01009
202009323581432739710.01179
Average error rate0.00623

Use case

To illustrate how HLL can speed up trend analysis, I use a simulated dataset containing a sale transaction log for both in-store and online sales, and explain how to achieve near-real-time product trend analysis. The testing dataset is based on a 100 TB TPCDS dataset, and the test environment is a six-node RA3.16xl Amazon Redshift cluster.

The business goal is to stock inventory of only high-quality products so we can optimize our warehousing and supply chain costs. We constantly analyze the product trend, replace unpopular products with new products, move off-season products out, and add seasonal products both in-store and online. In peak times, we want to monitor trends closely to make sure we have enough inventory for the next week, day, or even hour.

With a large customer base of almost 100 million, our sales transaction log is huge. The dataset contains 5 years of customer order data, a total of about 500 billion records. This data sits in our Amazon Simple Storage Service (Amazon S3) data lake.

For trend analysis, we build an HLL sketch from an external table and store sketches in an Amazon Redshift local table. Compared to the raw sales dataset, the HLL sketch is much smaller and grows more slowly. The more repeated value you have, the more space-saving you get from an HLL sketch. In this use case, our sales records can grow 10–100 times in size, and raw logs can quickly grow beyond 100 TB, but the sketch table size stay at a GB level in the hundreds.

Let’s assume the sales data comes from different sales channels and is stored in the external database salesdb separately, as summarized in the following table.

Sales ChannelChannel IDExternal table
In-Store1salesdb.instore_sales
Online2salesdb.online_sales
Seasonal3salesdb.seasonal_sales

We preprocess the raw sales log to compute HLL sketches in the most granular grouping for product trend analysis. The result is  stored in the sales_sketch table defined in the following code. We can have multiple sketch columns in the same table. After the one-time preprocessing, the HLL sketch is stored in Amazon Redshift. Further analysis can use the tables with sketch columns as frequently as needed and significantly reduce the Amazon Redshift Spectrum query cost. Furthermore, queries always get consistent fast performance against the HLL sketch table.

CREATE TABLE sales_sketch AS SELECT 1 AS channel, d_date, product, Hll_create_sketch(customer) AS hll_customer, Hll_create_sketch(order_number) AS hll_order, Count(1) AS total, Sum(quantity) AS total_quantity, Sum(( price - cost ) * quantity) AS total_profit FROM salesdb.instore_sales GROUP BY channel, product, d_date ORDER BY product, d_date; 

Because HLL sketch is mergeable, we can store the sales data from different sales channels together to easily compare channel efficiency:

INSERT INTO sales_sketch SELECT 2 AS channel, product, d_date, Hll_create_sketch(customer) AS hll_customer, Hll_create_sketch(order_number) AS hll_order, Count(1) AS total, Sum(quantity) AS total_quantity, Sum(( price - cost ) * quantity) AS total_profit FROM salesdb.online_sales GROUP BY channel, product, d_date ORDER BY product, d_date; 

With the query scheduler feature in Amazon Redshift, you can easily create a nightly extract, transform, and load (ETL) job to build a sketch for newly arrived data automatically and enjoy the benefit of HLL sketches:

INSERT INTO sales_sketch SELECT 2 AS channel, product, d_date, Hll_create_sketch(customer) AS hll_customer, Hll_create_sketch(order_number) AS hll_order, Count(1) AS total, Sum(quantity) AS total_quantity, Sum(( price - cost ) * quantity) AS total_profit FROM salesdb.online_sales Where d_date = DATEADD(day, -1, current_date) GROUP BY channel, product, d_date ORDER BY product, d_date; 

Now the data is ready to be analyzed. Let’s look at the overall customer trend in the past 5 years. The following query takes less than 5 seconds using HLL sketches, which makes interactive analysis easy:

SELECT d_date, Hll_cardinality(Hll_combine(hll_customer)) AS customer, Hll_cardinality(Hll_combine(hll_order)) AS order
FROM sales_sketch
GROUP BY d_date; Runtime: 4.9 seconds

The same analysis using COUNT DISTINCT uses more resources and takes several minutes, and query runtime also grows linearly with data growth.

The following visualization of our result shows that our sales grew in a steady upward trend. In the last 2 years, the growth increased faster.

bdb1237 trend analysis 1

Let’s drill down to each sales channel and find out what action triggered it.

We can see it’s due to the online sales we added from that time. Adding the online channel is definitely a great strategy.

SELECT d_date, Hll_cardinality(Hll_combine(hll_customer)) AS customer, Hll_cardinality(Hll_combine(hll_order)) AS ORDER
from sales_sketch
WHERE channel = 1
GROUP BY d_date; Runtime: 2.92 seconds SELECT d_date, Hll_cardinality(Hll_combine(hll_customer)) AS customer, Hll_cardinality(Hll_combine(hll_order)) AS ORDER
from sales_sketch
WHERE channel = 2
GROUP BY d_date; Runtime: 2.96 seconds

The following visualization shows the in-store results

bdb1237 trend analysis 2

The following visualization shows the online results.

bdb1237 trend analysis 3

With the fine-grained HLL sketch computed, we can filter or group by any combination of date range, sales channel, and product, then merge the HLL sketch to get a final approximate distinct count. Now let’s compare the top 10 popular products per sales channel:

SELECT product, Hll_cardinality(Hll_combine(hll_customer)) AS instore
FROM sales_sketch
WHERE channel = 1
GROUP BY product
ORDER BY instore DESC limit 10;
SELECT product, Hll_cardinality(Hll_combine(hll_customer)) AS online
FROM sales_sketch
WHERE channel = 2
GROUP BY product
ORDER BY online DESC limit 10;

Regardless of if we use a channel filter or not, all queries have consistent fast performance under 2 seconds.

The following table shows that the top 10 in-store products are completely different from the online products.

instoreonline
1001599
1501749
2003445
501149
1251743
1505899
2251437
2009587
5889
3289

Let’s dig into a few products and compare sales channel performance:

WITH instore AS (SELECT d_date, Hll_cardinality(Hll_combine(hll_customer)) AS customer_instore FROM sales_sketch WHERE product = <product_id> AND channel = 1 GROUP BY d_date), online AS (SELECT d_date, Hll_cardinality(Hll_combine(hll_customer)) AS customer_online FROM sales_sketch WHERE product = <product_id> AND channel = 2 GROUP BY d_date)
SELECT instore. d_date, customer_instore, customer_online
FROM instore LEFT OUTER JOIN online ON instore.d_date = online. d_date
ORDER BY instore. d_date; Runtime 0.5s

The following visualization shows the Product 1 results.

bdb1237 trend analysis 4

The following visualization shows the Product 2 results.

bdb1237 trend analysis 5

In the preceding example, Product 1 wasn’t as popular with customers online as Product 2. This informs inventory planning in the distribution centers as well as the retail stores to increase sales.

Lastly, let’s analyze seasonal product performance. For seasonal items, the turnover rate is much quicker, so we want to monitor the trend more closely and make adjustments as soon as possible.

The following code monitors the performance of our top seasonal product:

WITH top_1_product as ( SELECT product FROM sales_sketch WHERE channel = 3 GROUP BY product ORDER BY Hll_cardinality(Hll_combine(hll_order)) DESC limit 1)
SELECT d_date, Hll_cardinality(Hll_combine(hll_customer)) AS customer FROM sales_sketch
WHERE product IN (select product from top_1_product)
AND channel = 3
GROUP BY d_date; Runtime 2.7s

The following visualization shows our results.

bdb1237 trend analysis 6

To get a full picture quickly, we can compute the latest sales data on the fly and merge it with the HLL sketch history to get a complete result:

WITH full_sales AS
( SELECT product, d_date, Hll_create_sketch(customer) AS hll_customer, Hll_create_sketch(order_number) AS hll_order FROM seasonal_sales WHERE d_date = <CURRENT_DATE> GROUP BY product, d_date UNION ALL SELECT product, d_date, hll_customer, hll_order FROM sales_sketch WHERE channel = 3)
SELECT product, hll_cardinality(hll_combine(hll_customer)) AS seasonal
FROM full_sales
GROUP BY 1
ORDER BY 2 DESC; Runtime 0.9s

The HLL sketch provides you flexibility and high performance, and enables interactive trend analysis and near-real-time trend monitoring. For hot data like recent 1-week data, we can even keep hourly granularity. After 1 month, we can roll the daily sketch data up to monthly to get even better performance.

Conclusion

You can use HyperLogLog for a lot of use cases. For a video demo of another use case, check out Amazon Redshift HyperLogLog Demo.

Evaluate your use cases to see if they can use approximate count and take advantage of the HLL capability of Amazon Redshift.


About the Author

juan yu 100 Juan Yu is an Analytics Specialist Solutions Architect at Amazon Web Services, where she helps customers adopt cloud data warehouses and solve analytic challenges at scale. Prior to AWS, she had fun building and enhancing MPP query engines to improve customer experience on Big Data workloads.

 

 

 

costas zarifis 100Costas Zarifis is a Software Development Engineer at AWS. He spends most of his time building cutting-edge features for the Redshift cloud infrastructure and he is the lead developer of the HLL type and functionality in Redshift. He is passionate about learning new and exciting technologies, and enjoys working on large scalable systems. Costas holds a Ph.D. from the University of California, San Diego and his interests lie in web and database application frameworks and systems.