Marketing teams often rely on data engineers to provide a consumer dataset that they can use to launch marketing campaigns. This can sometimes cause delays in launching campaigns and consume data engineers’ bandwidth. The campaigns are often launched using complex solutions that are either code heavy or using licensed tools. The processes of both extract, transform, and load (ETL) and launching campaigns need engineers who know coding, take time to build, and require maintenance overtime.

You can now simplify this process by integrating AWS services like Amazon PinPoint, AWS Glue DataBrew, and AWS Lambda. We can use DataBrew (a visual data preparation service) to perform ETL, Amazon PinPoint (an outbound and inbound marketing communications service) to launch campaigns, and Lambda functions to achieve end-to-end automation. This solution helps reduce time to production, can be implemented by less-technical folks because it doesn’t require coding, and has no licensing costs involved.

In this post, we walk through the end-to-end workflow and how to implement this solution.

Solution overview

In this solution, the source datasets are pushed to Amazon Simple Storage Service (Amazon S3) using SFTP (batch data) or Amazon API Gateway (streaming data) services. DataBrew jobs perform data transformations and prepare the data for Amazon PinPoint to launch campaigns. We use Lambda to achieve end-to-end automation, which includes an alert system via Amazon Simple Notification Service (Amazon SNS) to alert relevant teams of anomalies or errors.

The workflow includes the following steps:

  1. We can load or push data to Amazon S3 either through AWS Command Line Interface (AWS CLI) commands, AWS access keys, an AWS transfer service (SFTP), or an API Gateway service.
  2. We use DataBrew to perform ETL jobs.
  3. These ETL jobs are either triggered (using Lambda functions) or scheduled.
  4. The processed dataset from DataBrew is imported to Amazon PinPoint as segments using trigger-based Lambda functions.
  5. Marketing teams use Amazon PinPoint to launch campaigns.
  6. We can also perform data profiling using DataBrew.
  7. Finally, we export Amazon PinPoint metrics data to Amazon S3 using Amazon Kinesis Data Firehose. We can use the data for further analysis using Amazon Athena and Amazon QuickSight.

The following diagram illustrates our solution architecture.

BDB 1712 image001 cropped

As a bonus step, we can create a simple web portal using AWS Amplify that makes API calls to Amazon PinPoint to launch campaigns in case we want to restrict users from launching campaigns using Amazon PinPoint from the AWS Management Console. This web portal can also provide basic metrics generated by Amazon PinPoint. You can also use it as a product or platform that anyone can use to launch campaigns.

To implement the solution, we complete the following steps:

  1. Create the source datasets using both batch ingestion and Amazon Kinesis streaming services.
  2. Build an automated data ingestion pipeline that transforms the source data and makes it campaign ready.
  3. Build a DataBrew data profile job, after which you can view profile metrics and alert teams in case of any anomalies in the source data.
  4. Launch a campaign using Amazon PinPoint.
  5. Export Amazon PinPoint project events to Amazon S3 using Kinesis Data Firehose.

Create the source datasets

In this step, we create the source datasets from both an SFTP server and Kinesis in Amazon S3.

First, we create an S3 bucket to store the source, processed, and campaign-ready data.

We can ingest data into AWS through many different methods. For this post, we consider methods for batch and streaming ingestion.

For batch ingestion, we create an SFTP-enabled server for source data ingestion. We can configure SFTP servers to store data on either Amazon S3 or Amazon Elastic File System (Amazon EFS). For this post, we configure an SFTP server to store data on Amazon S3.

We use API Gateway and Kinesis for stream ingestion. We can also push data to Kinesis directly, but using API Gateway is preferred because it’s easy to handle cross-account and authentication issues. For instructions on integrating API Gateway and a Kinesis stream, see Tutorial: Create a REST API as an Amazon Kinesis proxy in API Gateway.

Each dataset should be in its own S3 folder: s3://<bucket-name>/src-files/datasource1/, s3://<bucket-name>/src-files/datasource2/, and so on.

BDB 1712 image003

Build the automated data ingestion pipeline

In this step, we build the end-to-end automated data ingestion pipeline that transforms the source data and makes it campaign ready.

We use DataBrew for data preparation and data quality, and Lambda and Amazon SNS for automation and alerting. Amazon PinPoint can then use this campaign data to launch campaigns.

Our pipeline performs the following functions:

  1. Run transformations on source datasets.
  2. Merge the datasets to make the data campaign ready.
  3. Perform data quality checks.
  4. Alert relevant teams in case of anomalies in the data quality.
  5. Alert relevant teams in case of DataBrew job failures.
  6. Import the campaign-ready dataset as a segment in Amazon PinPoint.

We can either have one DataBrew project to perform necessary transformations on each dataset and merge all the datasets into one final dataset, or have one DataBrew project for each dataset and another project that merges all the transformed datasets. The advantage of having individual projects for each dataset is that we’re decoupling all data sources, so an issue in one data source doesn’t impact another. We use this latter approach in this post.

For instructions on building the DataBrew job, see Creating and working with AWS Glue DataBrew recipe jobs.

DataBrew provides more than 250 transformations. For this post, we add the following transformations to clean the source datasets:

  • Validate column values and add default values if missing
  • Convert column values to a standard format, such as standardize date values or convert lowercase to uppercase (Amazon PinPoint is case-sensitive)
  • Remove special characters if needed
  • Split values if needed
  • Check for duplicates
  • Add audit columns

You could also add a few data quality recipe steps to the recipe.
BDB 1712 image005

DataBrew jobs can be scheduled or trigger based (through a Lambda function). For this post, we configure jobs processing individual data sources to be trigger based, and the final job merging all datasets is scheduled. The advantage of having trigger-based DataBrew jobs is that it only triggers if you have a source file, which helps reduce costs.

We first configure an S3 event that triggers a Lambda function. The function triggers the DataBrew job based on the S3 key value. See the following code (Boto3) for our function:

import boto3 client = boto3.client('databrew') def lambda_handler(event, context):
record=event['Records'][0]
bucketname = record['s3']['bucket']['name']
key = record['s3']['object']['key']
print('key: '+key) if key.find('SourceFolder1') != -1:
print('processing source1 file')
response = client.start_job_run(Name='databrew-process-data-job1')
if key.find('SourceFolder2') != -1:
print('processing source2 file')
response = client.start_job_run(Name='databrew-process-data-job2')

The processed data appears in s3://<bucket-name>/src-files/process-data/data-source1/.

Next we create the job that merges the processed data files into a final, campaign-ready dataset. We can configure the job to merge only those files that have been dropped in the last 24 hours.

The campaign-ready data is located in s3://<bucket-name>/src-files/campaign-ready/. This dataset is now ready to serve as input to Amazon PinPoint.
BDB 1712 image009

Build a DataBrew data profile job

We can use DataBrew to run a data profile job on any of the datasets defined in the previous steps. When you profile your data, DataBrew creates a report called a data profile. This report displays statistics such as the number of rows in the sample and the distribution of unique values in each column. In this post, we use Lambda functions to read the report and detect anomalies and send alerts using Amazon SNS to respective teams for further action.

  1. On the DataBrew console, on the Datasets page, select your dataset.
  2. Choose Run data profile.
    BDB 1712 image011
  3. For Job name, enter a name for your job.
  4. For Data sample, select either Full dataset or Custom sample (for this post, we sample 20,000 rows).
    BDB 1712 image013
  5. In the Job output settings section, for S3 location, enter your output bucket.
  6. Optionally, select Enable encryption for job output file to encrypt your data.
  7. Configure optional settings, such as profile configurations; number of nodes, job timeouts, and retries; schedules; and tags.
  8. Choose an AWS Identity and Access Management (IAM) role for the profile job.
  9. Choose Create and run job.
    BDB 1712 image015

After you run the job, DataBrew provides you with job metrics. For example, the following screenshot shows the Dataset preview tab.
BDB 1712 image017

The following screenshot shows an example of the Data profile overview tab.
BDB 1712 image019

This tab also includes a summary of the column details.
BDB 1712 image021

The following screenshot shows an example of the Column statistics tab.
BDB 1712 image023

Next, we set up alerts using the profile output file.

  1. Configure an S3 event to trigger a Lambda function.

The function reads the output and checks for anomalies in the data. You define the anomalies; for example, when the total missing for a column is greater than 10. The function can then raise an SNS alert if it detects the anomaly.

Launch a campaign using Amazon PinPoint

Before you create segments, create an Amazon PinPoint project. To create segments, we use S3 events to trigger a Lambda function that creates a new base segment whenever the final DataBrew job loads the campaign-ready data. We can either create or update base segments; for this post, we create a new segment. See the following code (Boto3) for the Lambda function:

import boto3
import time
from datetime import datetime now = datetime.now() # current date and time date_time = now.strftime("%Y_%m_%d") segname = segment_name_'+date_time time.sleep(60) client = boto3.client('pinpoint') def lambda_handler(event, context): response = client.create_import_job(
ApplicationId='xxx-project-id-from-pinpoint',
ImportJobRequest={ 'DefineSegment': True, 'Format': 'CSV', 'RoleArn': 'arn:aws:iam::xxx-accountid:role/service-role/xxx_pinpoint_role', 'S3Url': 's3://<xxx-bucketname>/campaign-ready', 'SegmentName': segname
}
) print(response)

Use this base segment to create dynamic segments and launch a campaign. For more information, see Amazon PinPoint campaigns.

Export Amazon PinPoint project events to Amazon S3 using Kinesis Data Firehose

You can track and push events related to your project, such as sent, delivered, opened messages, and a few others, to either Amazon Kinesis Data Streams or Kinesis Data Firehose, which stream this data to AWS data stores such as Amazon S3. For this post, we use Kinesis Data Firehose. We create our delivery stream prior to enabling event streams on the Amazon PinPoint project.

  1. Create your Firehose delivery stream.

The event stream is disabled by default.

  1. Choose Edit.
  2. Select Stream to Amazon Kinesis and select Send events to an Amazon Kinesis Data Firehose delivery stream.
  3. Choose the delivery stream you created.
  4. For IAM role, you can allow DataBrew to automatically create a new role or use an existing role.
  5. Choose Save.

The events are now sent to Amazon S3. You could either create an Athena table or use Amazon Kinesis Data Analytics to analyze events and build a dashboard using QuickSight.

Security best practices

Consider the following best practices in order to mitigate security threats:

  • Be mindful while creating IAM roles to provide access to only necessary services
  • If sending emails through Amazon SNS, make sure you send email alerts only to verified or subscribed recipients to minimize the possibility of automated emails being used to target victims’ external email addresses
  • Use IAM roles rather than user keys
  • Have logs written to Amazon CloudWatch, and set CloudWatch alarms in case of failures
  • Take regular backups of DataBrew jobs and Amazon Pinpoint campaigns (if needed)
  • Restrict network access for inbound and outbound traffic to least privilege
  • Enable the lifecycle policy to retain only necessary data, and delete unnecessary data
  • Enable server-side encryption using AWS KMS (SSE-KMS) or Amazon S3 (SSE-S3)
  • Enable cross-Region replication of data in case you feel backing up the source data is necessary

Clean up

To avoid ongoing charges, clean up the resources you created as part of this post:

  • S3 bucket
  • SFTP server
  • DataBrew resources
  • PinPoint resources
  • Firehose delivery stream, if applicable
  • Athena tables and QuickSight dashboards, if applicable

Conclusion

In this post, we walked through how to implement an automated workflow using DataBrew to perform ETL, Amazon PinPoint to launch campaigns, and Lambda to automate the process. This solution helps reduce time to production, is easy to implement because it doesn’t require coding, and has no licensing costs involved. Try this solution today for your own datasets, and leave any comments or questions in the comments section.


About the Authors

sushivnSuraj Shivananda is a Solutions Architect at AWS. He has over a decade of experience in Software Engineering, Data and Analytics, DevOps specifically for data solutions, automating and optimizing cloud based solutions. He’s a trusted technical advisor and helps customers build Well Architected solutions on the AWS platform.

surbhi dangiSurbhi Dangi is a Sr Manager, Product Management at AWS. Her work includes building user experiences for Database, Analytics & AI AWS consoles, launching new database and analytics products, working on new feature launches for existing products, and building broadly adopted internal tools for AWS teams. She enjoys traveling to new destinations to discover new cultures, trying new cuisines, and teaches product management 101 to aspiring PMs.

Categories: Big Data