This post was written by Christian Kniep, Senior Developer Advocate for HPC and AWS Batch.
For HPC workloads, visibility into the logs of jobs is important to debug a job which failed, but also to have insights into a running job and track its trajectory to influence the configuration of the next job or terminate the job because it went off track.
With AWS Batch, customers are able to run batch workloads at scale, reliably and with ease as this managed serves takes out the undifferentiated heavy lifting. The customer can then focus on submitting jobs and getting work done. Customers told us that at a certain scale, the single logging driver available within AWS Batch made it hard to separate logs as they were all ending up in the same log group in Amazon CloudWatch.
With the new release of customer logging driver support, customers are now able to adjust how the job output is logged. Not only customize the Amazon CloudWatch setting, but enable the use of external logging frameworks such as
This allow AWS Batch jobs to use the existing systems they are accustom to, with fine-grained control of the log data for debugging and access control purposes.
In this blog, I show the benefits of custom logging with AWS Batch by adjusting the log targets for jobs. The first example will customize the Amazon CloudWatch log group, the second will log to Splunk, an external logging service.
To showcase this new feature, I use the AWS Command Line Interface (CLI) to setup the following:
- IAM roles, policies, and profiles to grant access and permissions
- A compute environment to provide the compute resources to run jobs
- A job queue, which supervises the job execution and schedules jobs on a compute environment
- A job definition, which uses a simple job to demonstrate how the new configuration can be applied
Once those tasks are completed, I submit a job and send logs to a customized CloudWatch log-group and Splunk.
To make things easier, I first set a couple of environment variables to have the information handy for later use. I use the following code to set up the environment variables.
When using the AWS Management Console, you must create IAM roles manually.
IAM Roles are defined to be used by a certain service. In the simplest case, you want a role to be used by Amazon EC2 – the service that provides the compute capacity in the cloud. This defines which entity is able to use an IAM Role, called Trust Policy. To set up a trust policy for an IAM role, use the following code snippet.
With the IAM trust policy, I now create an
ecsInstanceRole and attach the pre-defined policy
AmazonEC2ContainerServiceforEC2Role. This allows an instance to interact with Amazon ECS.
The AWS Batch service uses a role to interact with different services. The trust relationship reflects that the AWS Batch service is going to assume this role. You can set up this role with the following logic.
In addition to dealing with Amazon ECS, the instance role can create and write to Amazon CloudWatch log groups, to control which log group names are used, a condition is attached.
While the compute environment is coming up, let us create and attach a policy to make a new log-group possible.
At this point, I created the IAM roles and policies so that the instance and service are able to interact with the AWS APIs, including trust-policies to define which services are meant to use them. EC2 for the
ecsInstanceRole and the
AWSBatchServiceRole for the AWS Batch service itself.
Now, I am going to create a compute environment, which is going to spin up an instance (one vCPU target) to run the example job in.
Once this section is complete, a compute environment is being spun up in the back. This will take a moment. You can use the following command to check on the status of the compute environment.
aws batch describe-compute-environments
Once it is enabled and valid we can continue by setting up the job queue.
Now that I have a compute environment up and running, I will create a job queue which accepts job submissions and schedules the jobs on the compute environment.
The job definition is used as a template for jobs. This example runs a plain container and prints the environment variables. With the new release of AWS Batch, the logging driver
awslogs now allows you to change the log group configuration within the job definition.
Using the above job definition, you can now submit a job.
Now, you can check the ‘Log Group’ in CloudWatch. Go to the CloudWatch console and find the ‘Log Group’ section on the left.
Now, click on the log group defined above, and you should see the output of the job which allows for debugging if something within the container went wrong or processing logs and create alarms and reports.
Splunk is an established log engine for a broad set of customers. You can use the Docker container to set up a Splunk server quickly. More information can be found in the Splunk documentation. You need to configure the HTTP Event Collector, which provides you with a link and a token.
To send logs to Splunk, create an additional job-definition with the Splunk token and URL. Please adjust the
splunk-token to match your Splunk setup.
This forwards the logs to Splunk, as you can see in the following image.
This blog post showed you how to apply custom logging to AWS Batch using the
awslog and Splunk logging driver. While these are two important logging drivers, please head over to the documentation to find out about
json-file and other drivers to find the best driver to match your current logging infrastructure.