Amazon Redshift ML allows data analysts, developers, and data scientists to train machine learning (ML) models using SQL. In previous posts, we demonstrated how customers can use the automatic model training capability of Amazon Redshift to train their classification and regression models. Redshift ML provides several capabilities for data scientists. It allows you to create a model using SQL and specify your algorithm as XGBoost. It also lets you bring your pre-trained XGBoost model into Amazon Redshift for local inference. You can let users remotely invoke any model deployed in Amazon SageMaker for inference with SQL.
In this post, we illustrate how data scientists can train models using the XGBoost algorithm. Specifically, we discuss how you can use Redshift ML to train ML models with the
CREATE MODEL command by providing advanced parameters such as preprocessors, problem type, and hyperparameters.
In the post Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML, we reviewed the benefits of Redshift ML and how it simplifies your ML pipeline without the complexity of exporting your data from the data warehouse for use with ML. You don’t have to worry about the governance of data that you export from your data warehouse.
To get started, we need an Amazon Redshift cluster with the ML feature enabled. Redshift ML is generally available, and you can use it in a cluster with the latest maintenance version. For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.
The XGBoost algorithm (eXtreme Gradient Boosting) is an optimized open-source implementation of the gradient boosted trees algorithm. XGBoost is designed from the ground up to handle many data science problems in a highly efficient, flexible, portable, and accurate way. XGBoost can be used for regression, binary classification, multi-class classification, and ranking problems. For details on XGBoost and SageMaker, see Introducing the open-source Amazon SageMaker XGBoost algorithm container.
The Redshift ML
CREATE MODEL with
AUTO OFF option currently supports only XGBoost as the
MODEL_TYPE. As a data scientist or ML expert, you can provide relevant information such as objective, preprocessors, and hyperparameters as part of the
CREATE MODEL command based on your use case.
For this post, we use the banknote authentication dataset, which is a classic binary classification problem to predict whether a given banknote is genuine or forged. With analytical methods, you can use the Intaglio structures of circulating banknotes for authentication. Intaglio is a special printing technique that adds a unique structure to banknotes. Image recognition and the Wavelet Transform tool were used to obtain spectral features like variance, skewness, kurtosis, and entropy (of an image) from banknote specimens used as the input parameters for the model. Based on these measurements, you can employ ML methodologies to predict an original note from a forged one.
The following screenshot shows a sample dataset for this use case. The last column,
class, is the one we’re going to predict. Zero (0) corresponds to a genuine banknote and one (1) is counterfeit.
Setup and data preparation
We split the dataset into two sets (80% and 20%) to use for training and testing purposes. To get arbitrariness in the dataset, we use a random number when we split the data. The test data is available in the public Amazon Simple Storage Service (Amazon S3) bucket
The following are the DDL and COPY commands for both training and test datasets. If you prefer, you can also keep the dataset in a single table and use a where clause in the SQL statement to split the training and test data. For simple illustration purposes, we divide the dataset to be used in two different tables. You can use the Amazon Redshift Query Editor or your preferred SQL editor to run these SQL statements.
For loading data, use the appropriate AWS Identity and Access Management (IAM) role for your Amazon Redshift cluster.
Create the ML Model
Now that we have set up the data , let’s create the model in Redshift ML with
AUTO OFF and XGBoost as the model type using the
CREATE MODEL command:
We specified the
class column as the target (label) that we want to predict, and specified
func_model_banknoteauthentication_xgboost_binary as the function.
Make the appropriate changes in the
CREATE MODEL command to specify the
S3_BUCKET . Refer to the previous posts or the documentation on the requirements for the IAM role. The S3 bucket specified is used for exporting the training dataset and storing other ML-related artifacts.
We use the table
banknoteauthentication_train for training data, and the
MODEL_TYPE is set to
CREATE MODEL command is asynchronous, and the model creation happens in the background after you run the command. It may take 8–10 minutes for the model to be trained and ready; and you can monitor the progress using the
SHOW MODEL model_name; command:
Objectives and hyperparameters
As a data scientist, you might want to specify training objectives and hyperparameters for tuning your model. Let’s discuss briefly how you can use
HYPERPARAMETERS options in the
CREATE MODEL with
An objective in any ML problem is a measure of the quality of the solution. An objective of a problem aims to either minimize or maximize the function specified. For example,
reg:squarederror is used as the objective for regression problems, and the smaller the value (closer to zero) this function is, the better the accuracy of the model.
Redshift ML currently supports the following
OBJECTIVE options that go along with XGBoost:
multi:softmax. For more information, see Learning Task Parameters in the XGBoost documentation.
Hyperparameters are problem- and algorithm-specific properties that govern and control the behavior of the training and can have a direct impact on the efficiency of the model. If no hyperparameters are specified, the defaults for XGBoost are used. The values should be enclosed in single quotes. For our hyperparameters,
NUM_ROUND (the number of rounds to run the training) is set to 100. For a complete list of supported hyperparameters, see CREATE XGBoost models with AUTO OFF.
The following code is a representative output from
show model after the
create model command starts. It reflects the inputs presented in the
create model command and more information like model state (
FAILED) and the max runtime to finish. Leaving the model to run to completion with the default time allows you to tune the model to yield better accuracy.
create model command is complete, the
show model command output looks like the following code. The model state is now
READY and has additional information like
Estimated Cost and
train:error. For models created with Amazon SageMaker Autopilot (
AUTO ON), we see
validation as one metric that gives the model’s accuracy, whereas
AUTO OFF with XGBoost provides
train:error, which is a measure of accuracy. In the following code, the value
0.000000 indicates the model is close to 100% accurate.
For binary and multi-class classification problems, we compute the accuracy as the model metric. Accuracy can be calculated based on the following:
Let’s apply the preceding code to our use case to find the accuracy of the model. We use the
banknoteauthentication_test table to test the accuracy, and use the newly created function
func_model_banknoteauthentication_xgboost_binary to predict and take the columns other than the target and label as the input.
The inference query output (0.9923 *100 = 99.23 %) matches the output from the
show model command.
Let’s run the prediction query on the
banknoteauthentication_test to get the count of original vs. counterfeit banknotes:
Please check the Redshift ML regression blog for troubleshooting tips.
Redshift ML provides an easy and seamless platform for database users to create, train, and tune models using the SQL interface. This post showed how data scientists can train an XGBoost model using Redshift ML with the
CREATE MODEL command and use the model for inference using SQL.
About the Authors
Satish Sathiya is a Senior Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.
Debu Panda, a principal product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 25 years of experience in the IT world.
Jiayuan Chen is a Software Development Engineer at AWS. He is passionate about designing and building data-intensive applications, and has been working in the areas of data lake, query engine, ingestion, and analytics. He keeps up with latest technologies and innovates things that spark joy.
Lokesh Gupta is a Software Development Manager at AWS and manages the Machine Learning Compiler Technologies team. He leads the effort to optimize ML models for best inference performance on the Cloud and Edge targets.