By Ivan Bishop, Partner Solutions Architect, ISV Migrations at AWS
By Jon Roberts, Principal Engineer at Greenplum (Pivotal)
Are you thinking of deploying Pivotal Greenplum on the Amazon Web Services (AWS) Cloud? Many customers also want to shift the responsibility for infrastructure to AWS, considering the alternatives.
With an on-premises deployment, it takes time to provision floor space in a data center, run power cables and fiber, and ensure adequate cooling. Then, you have to acquire the hardware, provision IP addresses, install and harden the operating system (OS) across multiple machines, and finally address monitoring and security. Only then can you install and configure Pivotal Greenplum to your on-premises infrastructure.
With Pivotal Greenplum on AWS, deployments are completely automated and completes in less than an hour. In fact, the barrier to entry is low enough that business units may deploy production-ready clusters themselves, without IT involvement.
Pivotal Greenplum is a commercial fully-featured Massively Parallel Processing (MPP) data warehouse platform powered by the open source Greenplum Database. It provides powerful and rapid analytics on petabyte scale data volumes, and is available on AWS Marketplace.
Pivotal and AWS have worked together to make deployment and ongoing operations of Pivotal Greenplum easy and painless. Speed, ease of management, and security are some of the key reasons we see enterprises shifting Pivotal Greenplum to AWS.
In this post, we focus on leveraging Pivotal Greenplum (parallel Postures) for enterprise-scale analytics. We present discussions around deployment, updates, security, and speed.
Customers use Greenplum because it’s fast. You can run your query and the results will come back moments later. Greenplum on AWS is optimized for performance and can be even faster than a comparably configured on-premises deployment.
To achieve the best performance, we tuned both Greenplum and AWS resources in the following ways:
- All Pivotal Greenplum nodes are placed into Auto Scaling Groups to boost resiliency.
- All nodes feature 10GB or faster networking for maximum performance.
- Data volumes use throughput optimized ST1 disks.
The Greenplum CloudFormation template uses AWS placement groups to minimize latency by placing nodes in close physical proximity.
We gauge performance on AWS using the same open source utilities that are used for on-premises deployments: gpcheckperf, and the TPC-DS benchmark. We also factor in the documented AWS specs for each virtual machine (VM) and disk type.
In particular, the TPC-DS benchmark is extremely useful for comparing performance across deployments using real-world database loading and query activities.
TPC-DS Benchmark Score
The Transaction Processing Performance Council (TPC) has created many benchmarks for different database workloads. The most commonly used benchmark for big data, data warehousing, and analytics is the Decision Support (DS) benchmark, or TPC-DS.
This benchmark consists of a star schema with 24 tables and 99 queries. Common parameters for benchmarking are 3TB of data, and the query execution of both one and five concurrent users.
The TPC-DS benchmark also includes more traditional DS activities like update statements. However, for Pivotal Greenplum, these activities are omitted from the quoted scores you can see in Figure 1, as they do not apply.
Simply put, the higher the score, the faster the cluster. Thanks to Pivotal Greenplum’s MPP architecture, more hardware will produce better results. Therefore, it’s most useful to compare the score in relation to the number of segment cores in the cluster.
Figure 1 – TPC-DS benchmark scores as a function of instance size.
As you can see, Pivotal Greenplum on AWS can achieve better results than a comparably deployed on-premises appliance solution. Of course, there’s a price-performance balance you’ll need to strike, and your Pivotal account team can help you with that.
Note that the AWS vCPUs quoted above are hyper-threaded, so two vCPUs equate to a single core.
The best way to deploy Pivotal Greenplum is via AWS Marketplace. Follow the documentation, and your deployment will complete quickly, in less than an hour.
Figure 2 – A typical Greenplum deployment on AWS.
Pivotal Greenplum nodes are deployed on AWS using Auto Scaling Groups. It automatically provisions the number of nodes specified, and if a node fails for any reason, the Auto Scaling Group automatically terminates the failed node and replaces it with a new one.
Figure 3 – Failed Greenplum nodes replaced by Auto Scaling Group.
For data availability, Pivotal Greenplum uses mirroring, a concept similar to HDFS replication (three copies of the data). When a node fails, the Master node “promotes” the Mirror Segment to act as a Primary. After the new node comes online, the self-healing mechanism goes to work. It executes the commands needed to restore the system to its fully-functional state.
To ensure that user queries operate as normal during Segment recovery, the pgBouncer connection pooler pauses queries before Segments are rebalanced. This ensures that queries stay in the queue during Segment recovery.
Single Master Node Replacement
In an on-premises deployment of Pivotal Greenplum, a Standby Master node is recommended. This node is mostly idle; it’s there in case the Master node fails, ensuring continuity if and when the Master node is replaced.
Thanks to self-healing on AWS, the Standby Master process has been moved to the first Segment node as part of the automated AWS install process. Scripts within the Amazon Machine Image (AMI) assign roles to the nodes in the Auto Scaling Group. If the Master node were to fail, the Standby Master is temporarily made to be a Master, and then demoted back to be a Standby Master. This is all done automatically.
Figure 4 – The MDW distributes queries via the network interconnect to Segment nodes.
The Greenplum Database master (MDW) is the entry to the Greenplum Database system, accepting client connections and SQL queries, and distributing work to the segment instances (SDWn). When a user connects to the database via the Greenplum master and issues a query, processes are created in each segment database to handle the work of that query.
By carefully matching AWS instance types and storage usage, customers can optimize their AWS consumption and Pivotal Greenplum license spend whilst preserving or increasing performance.
Amazon Elastic Block Store (Amazon EBS) volumes have a snapshot feature that is useful in backing up an EBS volume to Amazon Simple Storage Service (Amazon S3). EBS snapshots are stored in Amazon S3, but not in a user-visible bucket.
Pivotal Greenplum on AWS includes the gpsnap utility. This automates the execution of EBS snapshots in parallel for your entire cluster.
Figure 5 – Making a gpsnap backup for a future possible restore.
Each disk gets a snapshot and is tagged so that gpsnap can be used to restore the snapshots to the correct nodes and mounts.
A backup can be created with gpsnap on AWS extremely quickly—typical execution times are around one minute. Snapshot performance is completely dependent on AWS, and Greenplum waits until all of the disk snapshots are in the “pending” or “completed” status before a database restart process kicks off.
The snapshots then have to complete, and that performance depends on how full the disks are and if there are prior snapshots. The gpcronsnap utility automates the scheduled execution of backups and are pre-configured to execute weekly.
A great advantage of deploying Pivotal Greenplum on AWS is taking advantage of EBS snapshots for disaster recovery (DR).
Figure 6 – With Greenplum, gpsnap data can be copied across AWS regions.
The aforementioned gpsnap utility can copy a snapshot from one region to another. You can then restore it to a new cluster when needed in a different region.
This is an on-demand, DR solution that is cost effective. You don’t need to add the cost and complexity of a second cluster.
Upgrading Pivotal Greenplum
Another cloud-only utility for Pivotal Greenplum is gprelease, which automates the upgrade of Pivotal Greenplum on AWS. It also upgrades optional packages, like MADlib, Command Center, and PostGIS.
The gpcronrelease utility runs weekly and will notify you when a new version is available. Even the cloud tools such as gpsnap and gprelease are upgraded with gprelease.
Customers will enjoy peak performance for Pivotal Greenplum by following a few proven best practices, like analyzing, vacuuming, and reindexing.
All of these practices are combined in the gpmaintain utility, which automates many of the administrative tasks needed in a production database. The gpcronmaintain utility automates scheduled maintenance and can be easily configured to run more or less frequently.
During the initial deployment of Pivotal Greenplum on AWS, many optional components are available. In Figure 7 below, you can see a few components that may interest data scientists and administrators, such as:
- Greenplum Database provides a collection of data science-related Python modules that can be used with the Greenplum Database using PL/Python or PL/R languages.
- Greenplum Command Center (GPCC) is a web-based application for monitoring and managing Greenplum clusters. GPCC works with data collected by agents running on the segment hosts and saved to the gpperfmon database
- MADlib is an open-source library for scalable in-database analytics. With the MADlib extension, you can use MADlib functionality in a Greenplum database.
- PostGIS is a spatial database extension that allows GIS objects to be stored in a Greenplum database.
Figure 7 – Greenplum install window.
It is possible to use the gpoptional utility to install or re-install any of these components after the deployment has been completed to further customize the deployment.
Pivotal Greenplum on AWS also includes phpPgAdmin, a web-based SQL tool. Business users, developers, and administrators use phpPgAdmin to perform ad hoc queries and browse schemas. It’s a handy utility for many common scenarios.
Pivotal has optimized phpPgAdmin for Pivotal Greenplum and created a Pivotal user interface theme. A self-signed SSL certificate is created during the deployment, so that traffic from your browser to the cluster is encrypted.
Figure 8 – Self-signed/commercial SSL certificate encrypt client-Greenplum connections.
In Figure 8 above, you see a Pivotal Greenplum SSL connection encrypting a query using a self-signed certificate.
Security in Review
Security is paramount, so Pivotal has worked with AWS to incorporate a number of best practices. These capabilities are designed to reduce your risk and ensure compliance with common enterprise requirements.
We protect your credentials, too. Password authentication is disabled; we use SSH keys instead. We also use MD5 encrypted password authentication, and we have disabled root and password file logins.
Want data encryption at rest? It’s available via Amazon EBS encryption. An added bonus, your snapshots are automatically encrypted if the source EBS volume is encrypted.
Lastly, all Greenplum deployments are created in a dedicated Amazon Virtual Private Cloud (VPC) to ensure network isolation and easier management of security rules.
In this post, we provided a stepwise discussion on why running Pivotal Greenplum on AWS is a compelling option for enterprise-scale analytics.
You can choose to leverage AWS for a Greenplum to simplify deployments over a traditional on-premises solution. Performance of the Greenplum database is comparable to, or greater than, the on-premise deployed solution by right-sizing the selected instance types during a highly automated CloudFormation execution. TPC-DS benchmark data helps align performance with instance pricing.
The AWS-deployed environment scales and “self heals” using Auto Scaling Groups, while day-to-day backups and disaster recovery (even across AWS regions) are possible by leveraging Amazon EBS snapshots combined with the Greenplum gpsnap tool.
Upgrading Greenplum is simplified using the cloud-only gprelease tool, whereas the core data science and other in-database analytics tools may be readily (re)configured using the gprelease and gpmaintain utilities.
Furthermore, optional installs provide a highly customized, customer-cenrtic data science environment. The phpPgAdmin tool provides easy access to Greenplum databases to run queries and perform schema analysis via SSL, if needed
Pivotal works closely with AWS to deploy and maintain a secure operating environment, and AWS Marketplace makes it simple for even small business groups to deploy Pivotal Greenplum on AWS.
You can learn more about Pivotal Greenplum in the eBook Data Warehousing with Greenplum, Second Edition.
AWS Competency Partners: The Next Smart
Pivotal is an AWS Competency Partner, and if you want to be successful in today’s complex IT environment and remain that way tomorrow and into the future, teaming up with an AWS Competency Partner is The Next Smart.
Pivotal – APN Partner Spotlight
Pivotal is an AWS Competency Partner. They help the world’s largest companies transform the way they build and run software. Pivotal Greenplum is a commercial fully-featured MPP data warehouse platform powered by the open source Greenplum Database..
*Already worked with Pivotal? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.
from AWS Partner Network (APN) Blog