Many businesses have an essential need for structured data stored in their own database for business operations and offerings. For example, a company that produces electronics may want to store a structured dataset of parts. This requires the following properties: color, weight, connector type, and more.
This data may already be available from external sources. In many cases, one source is sufficient. But often, multiple data sources from different vendors must be incorporated. Each data source might have a different structure for the same data field, which is problematic. Achieving one unified structure from variable sources can be difficult, and is a classic data mining problem.
We will break the problem into two main challenges:
- Locate and collect data. Collect from multiple data sources and load data into a data store.
- Unify the collected data. Since the collected data has no constraints, it might be stored in different structures and file formats. To use the collected data, it must be unified by performing an extract, transform, load (ETL) process. This matches the different data sources and creates one unified data store.
In this post, we demonstrate a pipeline of services, built on top of a serverless architecture that will handle the preceding challenges. This architecture supports large-scale datasets. Because it is a serverless solution, it is also secure and cost effective.
We use Amazon SageMaker Ground Truth as a tool for classifying the data, so that no custom code is needed to classify different data sources.
Data mining and structuring
There are three main steps to explore in order to solve these challenges:
- Collect the data – Data mine from different sources
- Map the data – Construct a dictionary of key-value pairs without writing code
- Structure the collected data – Enrich your dataset with a unified collection of data that was collected and mapped in steps 1 and 2
Following is an example of a use case and solution flow using this architecture:
- In this scenario, a company must enrich an empty data base with items and properties, see Figure 1.
- Data will then be collected from multiple data sources, and stored in the cloud, as shown in Figure 2.
- To unify different property names, SageMaker Ground Truth is used to label the property names with a list of properties. The results are stored in Amazon DynamoDB, shown in Figure 3.
- Finally, the database is populated and enriched by the mapped properties from the different data sources. This can be iterated with new sources to further enrich the data base, see Figure 4.
1. Collect the data
Using this serverless architecture illustrated in Figure 5, your teams can minimize the effort and cost. You’ll be able to handle large-scale datasets to collect and store the data required for your business.
We use Amazon S3 as it is a highly scalable and durable object storage service, and can store the original dataset. It will initiate an event that will invoke a Lambda function to start a state machine, using the original dataset as its input.
AWS Step Functions are used to orchestrate the process of preparing the dataset for parallel scraping of the items. It will automatically manage the queue of items to be processed when the dataset is large. Step Functions ensures visibility of the process, reports errors, and decouples the compute-intensive scraping operation per item.
The state machine has two steps:
- ETL the data to clean and standardize it. Store each item in Amazon DynamoDB, a fast and flexible NoSQL database service for any scale. The ETL function will create an array of all the items identifiers. The identifier is a unique describer of the item, such as manufacturer ID and SKU.
- Using the Map functionality of Step Functions, a Lambda function will be invoked for each item. This runs all your scrapers for that item and stores the results in an S3 bucket.
This solution requires custom implementation of only these two functions, according to your own dataset and scraping sources. The ETL Lambda function will contain logic needed to transform your input into an array of identifiers. The scraper Lambda function will contain logic to locate the data in the source and then store it.
Scraper function flow
For each data source, write your own scraper. The Lambda function can run them sequentially.
- Use the identifier input to locate the item in each one of the external sources. The data source can be an API, a webpage, a PDF file, or other source.
- API: Collecting this data will be specific to the interface provided.
- Webpages: Data is collected with custom code. There are open source libraries that are popular for this task, such as Beautiful Soup.
- PDF files: Consider using Amazon Textract. Amazon Textract will give you key-value pairs and table analysis.
- Transform the response to key-value pairs as part of the scraper logic.
- Store the key-value pairs in a sub folder of the scraper responses S3 bucket, and name it after that data source.
2. Mapping the responses
This pipeline is initiated after the data is collected. It creates a labeling job of Named Entity Recognition, with a pre-defined set of labels. The labeling work will be split among your Workforces. When the job is completed, the output manifest file for named entity recognition is used for the final ETL Lambda. This manually locates the labeling key and values detected by your workforce, and places the results in a reusable mapping table in DynamoDB.
Amazon SageMaker Ground Truth is a fully managed data labeling service that helps you build highly accurate training datasets for machine learning (ML). By using Ground Truth, your teams can unify different data sources to match each other, so they can be identified and used in your applications.
3. Structure the collected data
Using another Lambda function (see in Figure 8, populate items properties), we use the collected data (1), and the mapping (2), to populate the unified dataset into the original data DynamoDB table (3).
In this blog, we showed a solution to automatically collect and structure data. We used a serverless architecture that requires minimal effort, to build a reusable asset that can unify different property definitions from different data sources. Minimal effort is involved in structuring this data, as we use Amazon SageMaker Ground Truth to match and reconcile the new data sources.
For further reading:
- Step Functions Support for Dynamic Parallelism
- Another Serverless Architecture for a Web Scraping Solution
- Getting started with Amazon SageMaker Ground Truth