Proper data management is critical to successful, data-driven decision-making. An increasingly large number of customers are adopting data lakes to realize deeper insights from big data. As part of this, you need clean and trusted data in order to gain insights that lead to improvements in your business. As the saying goes, garbage in is garbage out—the analysis is only as good as the data that drives it.

Organizations today have continuously incoming data that may develop slight changes in schema, quality, or profile over a period of time. To ensure data is always of high quality, we need to consistently profile new data, evaluate that it meets our business rules, alert for problems in the data, and fix any issues. In this post, we leverage AWS Glue DataBrew, a visual data preparation tool that makes it easy to profile and prepare data for analytics and machine learning (ML). We demonstrate how to use DataBrew to publish data quality statistics and build a solution around it to automate data quality alerts.

Overview of solution

In this post, we walk through a solution that sets up a recurring profile job to determine data quality metrics and, using your defined business rules, report on the validity of the data. The following diagram illustrates the architecture.

We’ll walk through a solution that takes sets up a recurring Profile job to determine data quality metrics, and using your defined business rules.

The steps in this solution are as follows:

  1. Periodically send raw data to Amazon Simple Storage Service (Amazon S3) for storage.
  2. Read the raw data in Amazon S3 and generate a scheduled DataBrew profile job to determine data quality.
  3. Write the DataBrew profile job output to Amazon S3.
  4. Trigger an Amazon EventBridge event after job completion.
  5. Invoke an AWS Lambda function based on the event, which reads the profile output from Amazon S3 and determines whether the output meets data quality business rules.
  6. Publish the results to an Amazon Simple Notification Service (Amazon SNS) topic.
  7. Subscribe email addresses to the SNS topic to inform members of your organization.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Deploying the solution

For a quick start of this solution, you can deploy the provided AWS CloudFormation stack. This creates all the required resources in your account (us-east-1 Region). Follow the rest of this post for a deeper dive into the resources.

  1. Choose Launch Stack:

LaunchStack

  1. In Parameters, for Email, enter an email address that can receive notifications.
  2. Scroll to the end of the form and select I acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Create stack.

It takes a few minutes for the stack creation to complete; you can follow progress on the Events tab.

  1. Check your email inbox and choose Confirm subscription in the email from AWS Notifications.

The default behavior of the deployed stack runs the profile on Sundays. You can start a one-time run from the DataBrew console to try out the end-to-end solution.

Setting up your source data in Amazon S3

In this post, we use an open dataset of New York City Taxi trip record data from The Registry of Open Data on AWS. This dataset represents a collection of CSV files defining trips taken by taxis and for-hire vehicles in New York City. Each record contains the pick-up and drop-off IDs and timestamps, distance, passenger count, tip amount, fair amount, and total amount. For the purpose of illustration, we use a static dataset; in a real-world use case, we would use a dataset that is refreshed at a defined interval.

You can download the sample dataset (us-east-1 Region) and follow the instructions for this solution, or use your own data that gets dumped into your data lake on a recurring basis. We recommend creating all your resources in the same account and Region. If you use the sample dataset, choose us-east-1.

Creating a DataBrew profile job

To get insights into the quality of our data, we run a DataBrew profile job on a recurring basis. This profile provides us with a statistical summary of our dataset, including value distributions, sparseness, cardinality, and type determination.

Connecting a DataBrew dataset

To connect your dataset, complete the following steps:

  1. On the DataBrew console, in the navigation pane, choose Datasets.
  2. Choose Connect new dataset.
  3. Enter a name for the dataset.
  4. For Enter your source from S3, enter the S3 path of your data source. In our case, this is s3://nyc-tlc/misc/.
  5. Select your dataset (for this post, we choose the medallions trips dataset FOIL_medallion_trips_june17.csv).

BDB 1190 2

  1. Scroll to the end of the form and choose Create dataset.

Creating the profile job

You’re now ready to create your profile job.

  1. In the navigation pane, choose Datasets.
  2. On the Datasets page, select the dataset that you created in the previous step. The row in the table should be highlighted.
  3. Choose Run data profile.
  4. Select Create profile job.
  5. For Job output settings, enter an S3 path as destination for the profile results. Make sure to note down the S3 bucket and key, because you use it later in this tutorial.
  6. For Permissions, choose a role that has access to your input and output S3 paths. For details on required permissions, see DataBrew permission documentation.
  7. On the Associate schedule drop-down menu, choose Create new schedule.
  8. For Schedule name, enter a name for the schedule.
  9. For Run frequency, choose a frequency based on the time and rate at which your data is refreshed.
  10. Choose Add.

BDB 1190 3

  1. Choose Create and run job.

The job run on sample data typically takes 2 minutes to complete.

Exploring the data profile

Now that we’ve run our profile job, we can expose insightful characteristics about our dataset. We can also review the results of the profile through the visualizations of the DataBrew console or by reading the raw JSON results in our S3 bucket.

The profile analyzes both at a dataset level and column level granularity. Looking at our column analytics for String columns, we have the following statistics:

  • MissingCount – The number of missing values in the dataset
  • UniqueCount – The number of unique values in the dataset
  • Datatype – The data type of the column
  • CommonValues – The top 100 most common strings and their occurrences
  • Min – The length of the shortest String value
  • Max – The length of the longest String value
  • Mean – The average length of the values
  • Median – The middle value in terms of character count
  • Mode – The most common String value length
  • StandardDeviation – The standard deviation for the lengths of the String valuesBDB 1190 4 1

For numerical columns, we have the following:

  • Min – The minimum value
  • FifthPercentile – The value that represents 5th percentile (5% of values fall below this and 95% fall above)
  • Q1 – The value that represents 25th percentile (25% of values fall below this and 75% fall above)
  • Median – The value that represents 50th percentile (50% of values fall below this and 50% fall above)
  • Q3 – The value that represents 75th percentile (75% of values fall below this and 25% fall above)
  • NinetyFifthPercentile – The value that represents 95th percentile (95% of values fall below this and 5% fall above)
  • Max – The highest value
  • Range – The difference between the highest and lowest values
  • InterquartileRange – The range between the 25th percentile and 75th percentile values
  • StandardDeviation – The standard deviation of the values (measures the variation of values)
  • Kurtosis – The kurtosis of the values (measures the heaviness of the tails in the distribution)
  • Skewness – The skewness of the values (measures symmetry in the distribution)
  • Sum – The sum of the values
  • Mean – The average of the values
  • Variance – The variance of the values (measures divergence from the mean)
  • CommonValues – A list of the most common values in the column and their occurrence count
  • MinimumValues – A list of the 5 minimum values in the list and their occurrence count
  • MaximumValues – A list of the 5 maximum values in the list and their occurrence count
  • MissingCount – The number of missing values
  • UniqueCount – The number of unique values
  • ZerosCount – The number of zeros
  • Datatype – The datatype of the column
  • Min – The minimum value
  • Max – The maximum value
  • Median – The middle value
  • Mean – The average value
  • Mode – The most common value  BDB 1190 5 1

Finally, at a dataset level, we have an overview of the profile as well as cross-column analytics:

  • DatasetName – The name of the dataset the profile was run on
  • Size – The size of the data source in KB
  • Source – The source of the dataset (for example, Amazon S3)
  • Location – The location of the data source
  • CreatedBy – The ARN of the user that created the profile job
  • SampleSize – The number of rows used in the profile
  • MissingCount – The total number of missing cells
  • DuplicateRowCount – The number of duplicate rows in the dataset
  • StringColumnsCount – The number of columns that are of String type
  • NumberColumnsCount – The number of columns that are of numeric type
  • BooleanColumnsCount – The number of columns that are of Boolean type
  • MissingWarningCount – The number of warnings on columns due to missing values
  • DuplicateWarningCount – The number of warnings on columns due to duplicate values
  • JobStarted – A timestamp indicating when the job started
  • JobEnded – A timestamp indicating when the job ended
  • Correlations – The statistical relationship between columns

BDB 1190 6 1

By default, the DataBrew profile is run on a 20,000-row First-N sample of your dataset. If you want to increase the limit and run the profile on your entire dataset, send a request to [email protected].

Creating an SNS topic and subscription

Amazon SNS allows us to deliver messages regarding the quality of our data reliably and at scale. For this post, we create an SNS topic and subscription. The topic provides us with a central communication channel that we can broadcast to when the job completes, and the subscription is then used to receive the messages published to our topic. For our solution, we use an email protocol in the subscription in order to send the profile results to the stakeholders in our organization.

Creating the SNS topic

To create your topic, complete the following steps:

  1. On the Amazon SNS console, in the navigation pane, choose Topics.
  2. Choose Create topic.
  3. For Type, select Standard.
  4. For Name, enter a name for the topic.

For Name, enter a name for the topic.

  1. Choose Create topic.
  2. Take note of the ARN in the topic details to use later.

Creating the SNS subscription

To create your subscription, complete the following steps:

  1. In the navigation pane, choose Subscriptions.
  2. Choose Create subscription.
  3. For Topic ARN, choose the topic that you created in the previous step.
  4. For Protocol, choose Email.
  5. For Endpoint, enter an email address that can receive notifications.

For Endpoint, enter an email address that can receive notifications.

  1. Choose Create subscription.
  2. Check your email inbox and choose Confirm subscription in the email from AWS Notifications.

Creating a Lambda function for business rule validation

The profile has provided us with an understanding of the characteristics of our data. Now we can create business rules that ensure we’re consistently monitoring the quality our data.

For our sample taxi dataset, we will validate the following:

  • Making sure the pu_loc_id and do_loc_id columns meet a completeness rate of 90%.
  • If more than 10% of the data in those columns is missing, we’ll notify our team that the data needs to be reviewed.

Creating the Lambda function

To create your function, complete the following steps:

  1. On the Lambda console, in the navigation pane, choose Functions.
  2. Choose Create function.
  3. For Function name¸ enter a name for the function.
  4. For Runtime, choose the language you want to write the function in. If you want to use the code sample provided in this tutorial, choose Python 3.8.

For Runtime, choose the language you want to write the function in. If you want to use the code sample provided in this tutorial, choose Python 3.8.

  1. Choose Create function.

Adding a destination to the Lambda function

You now add a destination to your function.

  1. On the Designer page, choose Add destination.
  2. For Condition, select On success.
  3. For Destination type, choose SNS topic.
  4. For Destination, choose the SNS topic from the previous step.

For Destination, choose the SNS topic from the previous step.

  1. Choose Save.

Authoring the Lambda function

For the function code, enter the following sample code or author your own function that parses the DataBrew profile job JSON and verifies it meets your organization’s business rules.

If you use the sample code, make sure to fill in the values of the required parameters to match your configuration:

  • topicArn – The resource identifier for the SNS topic. You find this on the Amazon SNS console’s topic details page (for example, topicArn = 'arn:aws:sns:us-east-1:012345678901:databrew-profile-topic').
  • profileOutputBucket – The S3 bucket the profile job is set to output to. You can find this on the DataBrew console’s job details page (for example, profileOutputBucket = 'taxi-data').
  • profileOutputPathKey – The S3 key the profile job is set to output to. You can find this on the DataBrew console’s job details page (for example, profileOutputPathKey = profile-out/'). If you’re writing directly to an S3 bucket, keep this as an empty String (profileOutputPathKey = '').
    import json
    import boto3 sns = boto3.client('sns')
    s3 = boto3.client('s3')
    s3Resource = boto3.resource('s3') # === required parameters ===
    topicArn = 'arn:aws:sns:<YOUR REGION>:<YOUR ACCOUNT ID>:<YOUR TOPIC NAME>'
    profileOutputBucket = '<YOUR S3 BUCKET NAME>'
    profileOutputPrefix = '<YOUR S3 KEY>' def verify_completeness_rule(bucket, key): # completeness threshold set to 10% threshold = 0.1 # parse the DataBrew profile profileObject = s3.get_object(Bucket = bucket, Key = key) profileContent = json.loads(profileObject['Body'].read().decode('utf-8')) # verify the completeness rule is met on the pu_loc_id and do_loc_id columns for column in profileContent['columns']: if (column['name'] == 'pu_loc_id' or column['name'] == 'do_loc_id'): if ((column['missingValuesCount'] / profileContent['sampleSize']) > threshold): # failed the completeness check return False # passed the completeness check return True def lambda_handler(event, context): jobRunState = event['detail']['state'] jobName = event['detail']['jobName'] jobRunId = event['detail']['jobRunId'] profileOutputKey = '' if (jobRunState == 'SUCCEEDED'): profileOutputPostfix = jobRunId[3:] + '.json' bucket = s3Resource.Bucket(profileOutputBucket) for object in bucket.objects.filter(Prefix = profileOutputPrefix): if (profileOutputPostfix in object.key): profileOutputKey = object.key if (verify_completeness_rule(profileOutputBucket, profileOutputKey)): message = 'Nice! Your profile job ' + jobName + ' met business rules. Head to https://console.aws.amazon.com/databrew/ to view your profile.' subject = 'Profile job ' + jobName + ' met business rules' else: message = 'Uh oh! Your profile job ' + jobName + ' did not meet business rules. Head to https://console.aws.amazon.com/databrew to clean your data.' subject = 'Profile job ' + jobName + ' did not meet business rules' else: # State is FAILED, STOPPED, or TIMEOUT - intervention required message = 'Uh oh! Your profile job ' + jobName + ' is in state ' + jobRunState + '. Check the job details at https://console.aws.amazon.com/databrew#job-details?job=' + jobName subject = 'Profile job ' + jobName + ' in state ' + jobRunState response = sns.publish ( TargetArn = topicArn, Message = message, Subject = subject ) return { 'statusCode': 200, 'body': json.dumps(response) }

Updating the Lambda function’s permissions

In this final step of configuring your Lambda function, you update your function’s permissions.

  1. In the Lambda function editor, choose the Permissions tab.
  2. For Execution role, choose the role name to navigate to the AWS Identity and Access Management (IAM) console.
  3. In the Role summary, choose Add inline policy.
  4. For Service, choose S3.
  5. For Actions, under List, choose ListBucket.
  6. For Actions, under Read, choose Get Object.
  7. In the Resources section, for bucket, choose Add ARN.
  8. Enter the bucket name you used for your output data in the create profile job step.
  9. In the modal, choose Add.
  10. For object, choose Add ARN.
  11. For bucket name, enter the bucket name you used for your output data in the create profile job step and append the key (for example, taxi-data/profile-out).
  12. For object name, choose Any. This provides read access to all objects in the chosen path.
  13. In the modal, choose Add.
  14. Choose Review policy.
  15. On the Review policy page, enter a name.
  16. Choose Create policy. 

We return to the Lambda function to add a trigger later, so keep the Lambda service page open in a tab as you continue to the next step, adding an EventBridge rule.

Creating an EventBridge rule for job run completion

EventBridge is a serverless event bus service that we can configure to connect applications. For this post, we configure an EventBridge rule to route DataBrew job completion events to our Lambda function. When our profile job is complete, the event triggers the function to process the results.

Creating the EventBridge rule

To create our rule in EventBridge, complete the following steps:

  1. On the EventBridge console, in the navigation pane, choose Rules.
  2. Choose Create rule.
  3. Enter a name and description for the rule.
  4. In the Define pattern section, select Event pattern.
  5. For Event matching pattern, select Pre-defined pattern by service.
  6. For Service provider, choose AWS.
  7. For Service name, choose AWS Glue DataBrew.
  8. For Event type, choose DataBrew Job State Change.
  9. For Target, choose Lambda function.
  10. For Function, choose the name of the Lambda function you created in the previous step.

For Function, choose the name of the Lambda function you created in the previous step.

  1. Choose Create.

Adding the EventBridge rule as the Lambda function trigger

To add your rule as the function trigger, complete the following steps:

  1. Navigate back to your Lambda function configuration page from the previous step.
  2. In the Designer, choose Add trigger.
  3. For Trigger configuration, choose EventBridge (CloudWatch Events).
  4. For Rule, choose the EventBridge rule you created in the previous step.

For Rule, choose the EventBridge rule you created in the previous step.

  1. Choose Add.

Testing your system

That’s it! We’ve completed all the steps required for this solution to run periodically. To give it an end-to-end test, we can run our profile job once and wait for the resulting email to get our results.

  1. On the DataBrew console, in the navigation pane, choose Jobs.
  2. On the Profile jobs tab, select the job that you created. The row in the table should be highlighted.
  3. Choose Run job.
  4. In the Run job modal, choose Run job.

A few minutes after the job is complete, you should receive an email notifying you of the results of your business rule validation logic.

A few minutes after the job is complete, you should receive an email notifying you of the results of your business rule validation logic.

Cleaning up

To avoid incurring future charges, delete the resources created during this walkthrough.

Conclusion

In this post, we walked through how to use DataBrew alongside Amazon S3, Lambda, EventBridge, and Amazon SNS to automatically send data quality alerts. We encourage you to extend this solution by customizing the business rule validation to meet your unique business needs.


About the Authors

Romi BoimerRomi Boimer is a Sr. Software Development Engineer at AWS and a technical lead for AWS Glue DataBrew. She designs and builds solutions that enable customers to efficiently prepare and manage their data. Romi has a passion for aerial arts, in her spare time she enjoys fighting gravity and hanging from fabric.

 

 

Shilpa MohanShilpa Mohan is a Sr. UX designer at AWS and leads the design of AWS Glue DataBrew. With over 13 years of experience across multiple enterprise domains, she is currently crafting products for Database, Analytics and AI services for AWS. Shilpa is a passionate creator, she spends her time creating anything from content, photographs to crafts.