Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy to deploy, secure, scale, and monitor your Elasticsearch cluster in the AWS Cloud. Elasticsearch is a distributed database solution, which can be difficult to plan for and execute. This post discusses some best practices for deploying Amazon ES domains.

The most important practice is to iterate. If you follow these best practices, you can plan for a baseline Amazon ES deployment. Elasticsearch behaves differently for every workload—its latency and throughput are largely determined by the request mix, the requests themselves, and the data or queries that you run. There is no deterministic rule that can 100% predict how your workload will behave. Plan for time to tune and refine your deployment, monitor your domain’s behavior, and adjust accordingly.

Deploying Amazon ES

Whether you deploy on the AWS Management Console, in AWS CloudFormation, or via Amazon ES APIs, you have a wealth of options to configure your domain’s hardware, high availability, and security features. This post covers best practices for choosing your data nodes and your dedicated master nodes configuration.

When you configure your Amazon ES domain, you choose the instance type and count for data and the dedicated master nodes. Elasticsearch is a distributed database that runs on a cluster of instances or nodes. These node types have different functions and require different sizing. Data nodes store the data in your indexes and process indexing and query requests. Dedicated master nodes don’t process these requests; they maintain the cluster state and orchestrate. This post focuses on instance types. For more information about instance sizing for data nodes, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain. For more information about instance sizing for dedicated master nodes, see Get Started with Amazon Elasticsearch Service: Use Dedicated Master Instances to Improve Cluster Stability.

Amazon ES supports five instance classes: M, R, I, C, and T. As a best practice, use the latest generation instance type from each instance class. As of this writing, these are the M5, R5, I3, C5, and T2.

Choosing your instance type for data nodes

When choosing an instance type for your data nodes, bear in mind that these nodes carry all the data in your indexes (storage) and do all the processing for your requests (CPU). As a best practice, for heavy production workloads, choose the R5 or I3 instance type. If your emphasis is primarily on performance, the R5 typically delivers the best performance for log analytics workloads, and often for search workloads. The I3 instances are strong contenders and may suit your workload better, so you should test both. If your emphasis is on cost, the I3 instances have better cost efficiency at scale, especially if you choose to purchase reserved instances.

For an entry-level instance or a smaller workload, choose the M5s. The C5s are a specialized instance, relevant for heavy query use cases, which require more CPU work than disk or network. Use the T2 instances for development or QA workloads, but not for production. For more information about how many instances to choose, and a deeper analysis of the data handling footprint, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain.

Choosing your instance type for dedicated master nodes

When choosing an instance type for your dedicated master nodes, keep in mind that these nodes are primarily CPU-bound, with some RAM and network demand as well. The C5 instances work best as dedicated masters up to about 75 data node clusters. Above that node count, you should choose R5.

Choosing Availability Zones

Amazon ES makes it easy to increase the availability of your cluster by using the Zone Awareness feature. You can choose to deploy your data and master nodes in one, two, or three Availability Zones. As a best practice, choose three Availability Zones for your production deployments.

When you choose more than one Availability Zone, Amazon ES deploys data nodes equally across the zones and makes sure that replicas go into different zones. Additionally, when you choose more than one Availability Zone, Amazon ES always deploys dedicated master nodes in three zones (if the Region supports three zones). Deploying into more than one Availability Zone gives your domain more stability and increases your availability.

Elasticsearch index and shard design

When you use Amazon ES, you send data to indexes in your cluster. An index is like a table in a relational database. Each search document is like a row, and each JSON field is like a column.

Amazon ES partitions your data into shards, with a random hash by default. You must configure the shard count, and you should use the best practices in this section.

Index patterns

For log analytics use cases, you want to control the life cycle of data in your cluster. You can do this with a rolling index pattern. Each day, you create a new index, then archive and delete the oldest index in the cluster. You define a retention period that controls how many days (indexes) of data you keep in the domain based on your analysis needs. For more information, see Index State Management.

Setting your shard counts

There are two types of shards: primary and replica. The primary shard count defines how many partitions of data Elasticsearch creates. The replica count specifies how many additional copies of the primary shards it creates. You set the primary shard count at index creation and you can’t change it (there are ways, but it’s not recommended to use the _shrink or _split API for clusters under load at scale). You also set the replica count at index creation, but you can change the replica count on the fly and Elasticsearch adjusts accordingly by creating or removing replicas.

You can set the primary and replica shard counts if you create the index manually, with a POST command. A better way for log analytics is to set an index template. See the following code:

PUT _template/<template_name>
{ "index_patterns": ["logs-*"], "settings": { "number_of_shards": 10, "number_of_replicas": 1 }, "mappings": { … }
}

When you set a template like this, every index that matches the index_pattern has the settings and the mapping (if you specify one) applied to that index. This gives you a convenient way of managing your shard strategy for rolling indexes. If you change your template, you get your new shard count in the next indexing cycle.

You should set the number_of_shards based on your source data size, using the following guideline: primary shard count = (daily source data in bytes * 1.25) / 50 GB.

For search use cases, where you’re not using rolling indexes, use 30 GB as the divisor, targeting 30 GB shards. However, these are guidelines. Always test with your own data, indexing, and queries to find your optimal shard size.

You should try to align your shard and instance counts so that your shards distribute equally across your nodes. You do this by adjusting shard counts or data node counts so that they are evenly divisible. For example, the default settings for Elasticsearch versions 6 and below are 5 primary shards and 1 replica (a total of 10 shards). You can get even distribution by choosing 2, 5, or 10 data nodes. Although it’s important to distribute your workload evenly on your data nodes, it’s not always possible to get every index deployed equally. Use the shard size as the primary guide for shard count and make small (< 20%) adjustments, generally favoring more instances or smaller shards, based on even distribution.

Determining storage size

So far, you’ve mapped out a shard count, based on the storage needed. Now you need to make sure that you have sufficient storage and CPU resources to process your requests. First, find your overall storage need: storage needed = (daily source data in bytes * 1.25) * (number_of_replicas + 1) * number of days retention.

You multiply your unreplicated index size by the number of replicas and days of retention to determine the total storage needed. Each replica adds an additional storage need equal to the primary storage size. You add this again for every day you want to retain data in the cluster. For search use cases, set the number of days of retention to 1.

The total storage need drives a minimum on the instance type and instance based on the maximum storage that instance provides. If you’re using EBS-backed instances like the M5 or R5, you can deploy EBS volumes up to the supported limit. For more information, see Amazon Elasticsearch Service Limits.

For instances with ephemeral store, storage is limited by the instance type (for example, I3.8xlarge.elasticsearch has 7.8 TB of attached storage). If you choose EBS, you should use the general purpose, GP2, volume type. Although the service does support the io1 volume type and provisioned IOPS, you generally don’t need them. Use provisioned IOPS only in special circumstances, when metrics support it.

Take the total storage needed and divide by the maximum storage per instance of your chosen instance type to get the minimum instance count.

After you have an instance type and count, make sure you have sufficient vCPUs to process your requests. Multiply the instance count by the vCPUs that instance provides. This gives you a total count of vCPUs in the cluster. As an initial scale point, make sure that your vCPU count is 1.5 times your active shard count. An active shard is any shard for an index that is receiving substantial writes. Use the primary shard count to determine active shards for indexes that are receiving substantial writes. For log analytics, only the current index is active. For search use cases, which are read heavy, use the primary shard count.

Although 1.5 is recommended, this is highly workload-dependent. Be sure to test and monitor CPU utilization and scale accordingly.

As you work with shard and instance counts, bear in mind that Amazon ES works best when the total shard count is as small as possible—fewer than 10,000 is a good soft limit. Each instance should also have no more than 25 shards total per GB of JVM heap on that instance. For example, the R5.xlarge has 32 GB of RAM total. The service allocates half the RAM (16 GB) for the heap (the maximum heap size for any instance is 31.5 GB). You should never have more than 400 = 16 * 25 shards on any node in that cluster.

Use case

Assume you have a log analytics workload supporting Apache web logs (500 GB/day) and syslogs (500 GB/day), retained for 7 days. This post focuses on the R5 instance type as the best choice for log analytics. You use a three-Availability Zone deployment, one primary and two replicas per index. With a three-zone deployment, you have to deploy nodes in multiples of three, which drives instance count and, to some extent, shard count.

The primary shard count for each index is (500 * 1.25) / 50 GB = 12.5 shards, which you round to 15. Using 15 primaries allows additional space to grow in each shard and is divisible by three (the number of Availability Zones, and therefore the number of instances, are a multiple of 3). The total storage needed is 1,000 * 1.25 * 3 * 7 = 26.25 TB. You can provide that storage with 18x R5.xlarge.elasticsearch, 9x R5.2xlarge.elasticsearch, or 6x R5.4xlarge.elasticsearch instances (based on EBS limits of 1.5 TB, 3 TB, and 6 TB, respectively). You should pick the 4xlarge instances, on the general guideline that vertical scaling is usually higher performance than horizontal scaling (there are many exceptions to this general rule, so make sure to iterate appropriately).

Having found a minimum deployment, you now need to validate the CPU count. Each index has 15 primary shards and 2 replicas, for a total of 45 shards. The most recent indexes receive substantial write, so each has 45 active shards, giving a total of 90 active shards. You ignore the other 6 days of indexes because they are infrequently accessed. For log analytics, you can assume that your read volume is always low and drops off as the data ages. Each R5.4xlarge.elasticsearch has 16 vCPUs, for a total of 96 in your cluster. The best practice guideline is 135 = 90 * 1.5 vCPUs needed. As a starting scale point, you need to increase to 9x R5.4xlarge.elasticsearch, with 144 vCPUs. Again, testing may reveal that you’re over-provisioned (which is likely), and you may be able to reduce to six. Finally, given your data node and shard counts, provision 3x C5.large.elasticsearch dedicated master nodes.

Conclusion

This post covered some of the core best practices for deploying your Amazon ES domain. These guidelines give you a reasonable estimate of the number and type of data nodes. Stay tuned for subsequent posts that cover best practices for deploying secure domains, monitoring your domain’s performance, and ingesting data into your domain.

 


About the Author

Jon Handler

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.