This is a guest post by Pradnya Bhandary, Co-Founder and CEO at Bryte Systems.
Data can be transformative for an organization. How and where you store your data for analysis and business intelligence is therefore an especially important decision that each organization needs to make. Should you choose an on-premises data warehouse solution or embrace the cloud?
On-premises data warehouses require servers to be in your data center, a team to manage them, and a large initial investment that tries to accommodate current and future data needs. Cloud data warehouses, on the other hand, are fully managed and can start with a small investment and grow as your business demands. You don’t need to provision for the future, but for present needs. This knowledge brings peace of mind that extra capacity can be added overnight, if needed.
One such cloud data warehouse is Amazon Redshift. Amazon Redshift is the most popular cloud data warehouse. It’s a fully managed, petabyte-scale cloud-based data warehouse product designed for large-scale dataset storage and analysis. It provides agility, flexibility, and cost-effectiveness.
To use Amazon Redshift effectively, getting your data efficiently from various data silos to the data warehouse is critical, because this determines how quickly you can access the data. Do you have to wait until the data is loaded and make do with stale data, or can you access your data across the organization in near-real time to derive fresh and rich data insights?
In this post, we explain how Origin Energy followed some of the best practices for data ingestion for Amazon Redshift and how they achieved faster ingestion rates into Amazon Redshift using the BryteFlow software.
Origin Energy, a leading energy provider, was finding that its on-premises data warehouse was struggling to support growing data demands. They also needed to unlock siloed data from legacy databases like SAP, Oracle, SQL Server, MySQL, and more. Access to data was fragmented and time consuming.
Moving to a cloud enterprise analytics environment with centralized data access was the only viable option to support their data initiatives. For more information, see the Origin Energy case study.
To speed up data ingestion on Amazon Redshift, they followed data ingestion best practices.
Log-based CDC mechanism to get data to Amazon Redshift
When data is replicated from a source database to a target that could be another database, data warehouse, or cloud data storage object, changes to the data in the source need to be captured and replicated to the destination to keep data consistent and trustworthy. Change data capture (CDC) makes this possible. CDC captures the changes in the source data and updates only the data in the destination that has changed. This does away with the tedious task of bulk load updating and enables real-time data integration. It’s a continual, extremely reliable process and has no impact on source systems.
Parallel multi-threaded initial sync
When doing an initial ingest of data, especially of exceptionally large datasets to Amazon Redshift, parallel, multi-thread syncing to replicate data is extremely helpful in cutting down total data ingestion time. Data replication proceeds in parallel on multiple threads with optimized extraction and loading.
Parallel multi-threaded log-based capture and merge for Oracle
When source systems generate a large amount of incremental data, and transactional logs are written very often, the CDC mechanism or mining of logs for incremental data can lag behind if it can’t keep up with the source throughput. You can configure BryteFlow to have multiple parallel threads for mining. Furthermore, you can configure BryteFlow so that the logs can be mined on a completely different server and therefore there is zero load on the source and data replication is much faster with parallelism. You can make the transaction logs available on a shared mount on a remote server you spin up, and mine logs on this server. This puts zero load on the source systems and operational systems and they aren’t impacted even with mining huge volumes of data (or volumes of incremental data).
Best practices for loading data to Amazon Redshift
The following are the three best practices for loading data into Amazon Redshift.
Split large files into multiple files for high-performance loads
Amazon Redshift is a massively parallel processing (MPP) data warehouse, where several compute nodes work in parallel to ingest the data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. The following table shows the different node types and the number of slices.
|Node Type||Default Slices per Node|
|Dense Compute DC2|
|Dense Storage DS2|
After you extract data into files, you can compress the files and split a single file to multiple files according to the number of slices, so that files are loaded with the compute being distributed evenly across the slices on Amazon Redshift. The number of multiple files is a configurable parameter in BryteFlow that can be set depending on the Amazon Redshift node types. The COPY command that ingests data into Amazon Redshift is configured optimally for fast data loads.
Automatic creation of tables, default distribution keys, and distribution style
Data is distributed among the nodes on the basis of distribution style and distribution key of a particular table in Amazon Redshift. An even distribution of data enables Amazon Redshift to assign the workload evenly to slices and maximizes the benefit of parallel processing. This also helps during data ingestion. When ingesting data, BryteFlow automatically creates tables with the right DDL on Amazon Redshift. It also creates default distribution keys and distribution style, so that table ingestion is highly performant.
Optimum sort keys used for optimum loads support efficient columnar storage
Data in the Amazon Redshift data warehouse is stored in a columnar fashion, which drastically reduces the number of disk I/O requests and minimizes the amount of data loaded into the memory to run a query. Reduction in I/O speeds up queries, and loading less data means Amazon Redshift can perform more in-memory processing. Using the optimum table sort keys is the best practice for efficient loads.
Automatic sync on Amazon Redshift
CDC is an important element of syncing data with Amazon Redshift. BryteFlow automatically merges changes on Amazon Redshift, with type2 history (if configured) with high performance. This means that data is ready to use as soon as it is ingested, without running lengthy ETL processes on it. The following diagram illustrates the type2 history feature.
Metadata for every extract and load can be captured on Amazon Aurora
Details on each extract and load process (for example, tables names), the number of records affected, the start and end times, and various other details is critical operational metadata that’s very useful in determining performance, tuning, and triggering other ETL processes. If this operational metadata is maintained on Amazon Redshift with every extract and load, constant single row inserts and updates can hamper performance drastically because Amazon Redshift is a columnar database and not an OLTP system.
At BryteFlow, we found that the best practice is to keep the operational metadata in an Amazon Aurora database, which is OLTP in nature and can store this metadata with constant updates and inserts and low latency.
Build your data lake on Amazon S3 and automatically query using Amazon Redshift Spectrum
BryteFlow enables you to build a continually refreshing data lake at scale on Amazon Simple Storage Service (Amazon S3) with continual replication and transformation of data. You can configure BryteFlow to use Amazon EMR on the incremental data with data on Amazon S3 and automatically merge and transform your data with an intuitive GUI. The EMR cluster can scale up or down depending on your data needs. You can then query data automatically on Amazon S3 using Amazon Redshift Spectrum.
Offload data ingestion and data preparation on Amazon S3 and load to Amazon Redshift
BryteFlow Blend helps with real-time data preparation of data ingested by BryteFlow, using Apache Spark on Amazon EMR with an intuitive GUI. You can load data prepared on Amazon S3 to Amazon Redshift via BryteFlow Blend or make it accessible to Amazon Redshift via Amazon Redshift Spectrum. This helps reserve the computational resources of Amazon Redshift for the actual querying (queries run much faster) while the Amazon S3 data lake handles data integration. The following diagram illustrates the distributed data integration architecture.
BryteFlow software supports all of the preceding ingestion best practices. Origin Energy used the BryteFlow software to build their analytics platform on Amazon Redshift. Origin’s data access has improved from several days to mere hours. BryteFlow software has helped achieve accurate data replication on the AWS Cloud with low latency, facilitating faster time-to-market for new and highly personalized customer offerings and a significant reduction in data costs.
Amazon Redshift delivers fast performance at scale for the most demanding workloads. Ingesting and preparing your data to Amazon Redshift using the BryteFlow software makes this an extremely attractive value proposition. You can liberate your data across data silos and quickly unlock the value on Amazon Redshift.
To see how this works for your project, you can get a free trial from our website. We offer complete support on your free trial including screen sharing, online support, and consultation.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
About the Authors
Pradnya Bhandary is the Co-Founder and CEO at Bryte Systems.