Amazon Redshift, a fast, fully managed, widely used cloud data warehouse, natively integrates with Amazon SageMaker for machine learning (ML). Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Data analysts and database developers want to use this data to train ML models, which can then be used to generate insights for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies.

Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. In a previous post, we covered how Amazon Redshift ML allows you to use your data in Amazon Redshift with SageMaker, a fully managed ML service, without requiring you to become an expert in ML. We also discussed how Amazon Redshift ML enables ML experts to create XGBoost or MLP models in an earlier post. Additionally, Amazon Redshift ML allows data scientists to either import existing SageMaker models into Amazon Redshift for in-database inference or remotely invoke a SageMaker endpoint.

This post shows how you can enable your data warehouse users to use SQL to invoke a remote SageMaker endpoint for prediction. We first train and deploy a Random Cut Forest model in SageMaker, and demonstrate how you can create a model with SQL to invoke that SageMaker predictions remotely. Then, we show how end users can invoke the model.

Prerequisites

To get started, we need an Amazon Redshift cluster with the Amazon Redshift ML feature enabled. For an introduction to Amazon Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

You also have to make sure that the SageMaker model is deployed and you have the endpoint. You can use the following AWS CloudFormation template to provision all the required resources in your AWS accounts automatically.

Solution overview

Amazon Redshift ML supports text and CSV inference formats. For more information about various SageMaker algorithms and their inference formats, see Random Cut Forest (RCF) Algorithm.

Amazon SageMaker Random Cut Forest (RCF) is an algorithm designed to detect anomalous data points within a dataset. Examples of anomalies that are important to detect include when website activity uncharacteristically spikes, when temperature data diverges from a periodic behavior, or when changes to public transit ridership reflect the occurrence of a special event.

In this post, we use the SageMaker RCF algorithm to train an RCF model using the Notebook generated by the CloudFormation template on the Numenta Anomaly Benchmark (NAB) NYC Taxi dataset.

We downloaded the data and stored it in an Amazon Simple Storage Service (Amazon S3) bucket. The data consists of the number of New York City taxi passengers over the course of 6 months aggregated into 30-minute buckets. We naturally expect to find anomalous events occurring during the NYC marathon, Thanksgiving, Christmas, New Year’s Day, and on the day of a snowstorm.

We then use this model to predict anomalous events by generating an anomaly score for each data point.

The following figure illustrates how we use Amazon Redshift ML to create a model using the SageMaker endpoint.

1 5967 Model

Deploy the model

To deploy the model, go to the SageMaker console and open the notebook that was created by the CloudFormation template.

2 5967 Notebook instances

Then choose bring-your-own-model-remote-inference.ipynb.

3 5967 jupyter

Set up parameters as shown in the following screenshot and then run all cells.

4 5967 bring your own model

Get the SageMaker model endpoint

On the Amazon SageMaker console, under Inference in the navigation pane, choose Endpoints to find your model name. You use this when you create the remote inference model in Amazon Redshift.

5 5967

Prepare data to create a remote inference model using Amazon Redshift ML

Create the schema and load the data in Amazon Redshift using the following SQL:

DROP TABLE IF EXISTS public.rcf_taxi_data CASCADE;
CREATE TABLE public.rcf_taxi_data
(
ride_timestamp timestamp,
nbr_passengers int
);
COPY public.rcf_taxi_data
FROM 's3://sagemaker-sample-files/datasets/tabular/anomaly_benchmark_taxi/NAB_nyc_taxi.csv'
iam_role 'arn:aws:iam:::<accountid>:role/RedshiftML' ignoreheader 1 csv delimiter ',';

Amazon Redshift now supports attaching the default IAM role. If you have enabled the default IAM role in your cluster, you can use the default IAM role as follows.

COPY public.rcf_taxi_data
FROM 's3://sagemaker-sample-files/datasets/tabular/anomaly_benchmark_taxi/NAB_nyc_taxi.csv'
iam_role default ignoreheader 1 csv delimiter ',';

You can use the Amazon Redshift query editor v2 to run these commands.

6 5967 redshift query

Create a model

Create a model in Amazon Redshift ML using the SageMaker endpoint you previously captured:

CREATE MODEL public.remote_random_cut_forest
FUNCTION remote_fn_rcf(int)
RETURNS decimal(10,6)
SAGEMAKER 'randomcutforest-xxxxxxxxx'
IAM_ROLE 'arn:aws:iam::<accountid>:role/RedshiftML';

Check model status

You can use the show model command to view the status of the model:

show model public.remote_random_cut_forest

You get output like the following screenshot, which shows the endpoint and function name.

7 5967

Compute anomaly scores across the entire taxi dataset

Now, run the inference query using the function name from the create model statement:

select ride_timestamp, nbr_passengers, public.remote_fn_rcf(nbr_passengers) as score
from public.rcf_taxi_data;

The following screenshot shows our results.

8 5967 query editor v2

Now that we have our anomaly scores, we need to check for higher-than-normal anomalies.

Amazon Redshift ML has batching optimizations to minimize the communication cost with SageMaker and offers high-performance remote inference.

Check for high anomalies

The following code runs a query for any data points with scores greater than three standard deviations (approximately 99.9th percentile) from the mean score:

with score_cutoff as
(select stddev(public.remote_fn_rcf(nbr_passengers)) as std, avg(public.remote_fn_rcf(nbr_passengers)) as mean, ( mean + 3 * std ) as score_cutoff_value
from public.rcf_taxi_data) select ride_timestamp, nbr_passengers, public.remote_fn_rcf(nbr_passengers) as score
from public.rcf_taxi_data
where score > (select score_cutoff_value from score_cutoff)
order by 2 desc;

The data in the following screenshot shows that the biggest spike in ridership occurs on November 2, 2014, which was the annual NYC marathon. We also see spikes on Labor Day weekend, New Year’s Day and the July 4th holiday weekend.

9 5967

Conclusion

In this post, we used SageMaker Random Cut Forest to detect anomalous data points in a taxi ridership dataset. In this data, the anomalies occurred when ridership was uncharacteristically high or low. However, the RCF algorithm is also capable of detecting when, for example, data breaks periodicity or uncharacteristically changes global behavior.

We then used Amazon Redshift ML to demonstrate how you can make inferences on unsupervised algorithms (such as Random Cut Forest). This allows you to democratize ML by making predictions with Amazon Redshift SQL commands.

For more information about building different models with Amazon Redshift ML see the Amazon Redshift ML documentation.


About the Authors

Phil BatesPhil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

Debu PandaDebu Panda, a principal product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 25 years of experience in the IT world.

Nikos KoulourisNikos Koulouris is a Software Development Engineer at AWS. He received his PhD from University of California, San Diego and he has been working in the areas of databases and analytics.

Murali 100Murali Narayanaswamy is a principal machine learning scientist in AWS. He received his PhD from Carnegie Mellon University and works at the intersection of ML, AI, optimization, learning and inference to combat uncertainty in real-world applications including personalization, forecasting, supply chains and large scale systems.