Since its launch in 2009, Amazon CloudWatch has become the cloud-native choice for a monitoring and observability service built for DevOps engineers, developers, site reliability engineers (SREs), and IT managers. CloudWatch provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health that includes common metrics. Amazon EventBridge complements CloudWatch and provides real-time access to changes in data in AWS services, your own applications, and software as a service (SaaS) applications without writing code. In addition to providing native integration with other AWS services like AWS Systems Manager, EventBridge also integrates with many third-party SaaS platforms, making it a powerful instrument in the observability tool chest for customers.

Customers want to know how to get deeper insights into operating system (OS)-level process and service-level metrics and combine them with EventBridge to trigger auto-remediation at scale through Systems Manager.

In this post, we’ll show how you can use EventBridge rules to run Systems Manager automation at scale across your Amazon Elastic Compute Cloud (Amazon EC2) fleet in response to CloudWatch alarms that monitor OS-level process state change.

Workflow

Complete the steps in this post to create the following workflow:

  • An EC2 instance running the Apache HTTP Server and the CloudWatch agent proctstat plugin that monitors the httpd process.
  • When the httpd process is stopped, CloudWatch raises an alarm and sends an event to EventBridge.
  • EventBridge receives the CloudWatch event that matches the event pattern in the predefined rule. EventBridge sends the event to the specified target (Systems Manager) and triggers the action defined in the rule.
  • The executeAwsApi automation action calls the SendCommand API action that includes the EC2 instance ID and the SSM document (runbook) to the SSM Agent running on the EC2 instance.
  • SSM Agent executes the automation (runbook) on the EC2 instance to restart the httpd process.

Figure 1 show the architecture and flow of the proposed solution.

When the httpd state changes, the CloudWatch alarm is triggered and sends the event to EventBridge, which matches the pattern in the rule and triggers the action defined in the rule. That, in turn, executes the runbook to restart the process

Figure 1: Solution architecture

Deployment steps

  • Set up an EC2 instance running Amazon Linux 2 with Apache HTTP Server and install and configure the procstat plugin.
  • Create an IAM role to execute the EventBridge rule.
  • Create a runbook to run executeAwsApi that uses SendCommand with the EC2 instance ID and the runbook to the SSM Agent to start and restart the httpd process.
  • Create a CloudWatch alarm to monitor the httpd process state change (for example, from running to stopped).
  • Integrate EventBridge with Systems Manager. Create an EventBridge rule to receive the event, trigger the action defined in the rule, and send the event to the target (Systems Manager).

Install and set up the SSM Agent, procstat plugin, and Apache HTTP Server on EC2

The SSM Agent that is required to use Systems Manager Automation is already installed, by default, on Amazon Linux 1 and 2 AMIs. On EC2 instances created from other Linux AMIs , you must install SSM Agent manually. For instructions, see Installing and configuring SSM Agent on EC2 instances for Linux in the Systems Manager User Guide.

The recommended way to install and configure the CloudWatch agent and procstat plugin is to use Systems Manager. For instructions, see the Detecting and remediating process issues on EC2 instances using Amazon CloudWatch and AWS Systems Manager blog post and Installing the CloudWatch agent on EC2 instances using your agent configuration in the CloudWatch User Guide. The process uses the AWS-ConfigureAWSPackage Automation document in a SSM Run Command. The SSM agent must already be installed.

To install Apache HTTP Server on EC2, see Create an EC2 instance and install a web server in the Amazon RDS User Guide.

Create an IAM role to execute the EventBridge rule

To execute the automation, you must attach an AWS Identity and Access Management (IAM) role with the AmazonSSMFullAccess policy to the EC2 instance. The role will be used to configure the EventBridge rule to run the SSM Automation document. The managed policy grants full access to the Systems Manager API and documents. As a best practice, always grant least privilege (that is, grant only the permissions required to perform a task).

We recommend that you create the role using an AWS CloudFormation template. Edit the trust relationships for the AutomationServiceRole to include events.amazonaws.com. See Figure 2.

The Summary section for the AutomationService Role includes the role ARN, description, creation time, and more. The Trust relationships tab is selected and displays these trusted entities: ssm.amazonaws.com, events.amazonaws.com, ec2.amazonaws.com

Figure 2: AutomationServiceRole to execute the EventBridge rule

Create a SSM runbook

An Automation document (now referred to as a runbook) defines the actions that Systems Manager performs on managed instances and other AWS resources when an automation runs. A runbook contains one or more steps that run in sequential order. Each step is built around a single action. Output from one step can be used as input in a later step.

To create a runbook, see Creating a runbook using the Editor in the Systems Manager User Guide. Set the DocumentName parameter to Monitor_Process_SSM_Document.

Systems Manager can execute several Automation actions. aws:executeAwsApi calls and runs AWS API operations that, in turn, trigger SendCommand to restart the EC2 httpd process.

To create the Run Command document that will restart the httpd process, in the Systems Manager console, choose Shared Resources, and then choose Documents.

From Create document, choose Command or Session, as shown in Figure 3.

The Owned by Amazon tab is selected. Under Create document, Command or Session is highlighted.

Figure 3: Create the Run Command document

Complete the required fields as shown:

Document Name: Monitor_Process_SSM_Document
Target type: /AWS::EC2::instance
Document type: Command Document
Content: JSON

In the Context field, enter the following JSON and then choose Create document.

{ "schemaVersion": "1.2", "description": "restart httpd", "parameters": {}, "runtimeConfig": { "aws:runShellScript": { "properties": [ { "id": "0.aws:runShellScript", "runCommand": [ "sudo systemctl start httpd", "echo Process restarted with status $?" ] } ] } }
}

The example runbook looks as follows:

{ "description": "Custom Automation to send SSM command to an instance", "schemaVersion": "0.3", "assumeRole": "{{ AutomationAssumeRole }}", "parameters": { "AutomationAssumeRole": { "type": "String", "description": "(Required) The ARN of the role that allows Automation to perform\nthe actions on your behalf. If no role is specified, Systems Manager Automation\nuses your IAM permissions to run this runbook.", "default": "" }, "InstanceId": { "type": "String", "description": "(Required) The ID of the EC2 instance.", "default": "" } }, "mainSteps": [ { "name": "createImage", "action": "aws:executeAwsApi", "onFailure": "Abort", "inputs": { "Service": "ssm", "Api": "send_command", "InstanceIds": [ "{{ InstanceId }}" ], "DocumentName": "Monitor_Process_SSM_Document" }, "outputs": [ { "Name": "Command", "Selector": "$.Command.CommandId", "Type": "String" } ] } ]
}

Create a CloudWatch alarm

From the left navigation pane of the CloudWatch console, choose Alarms, choose Create Alarm, and then choose Select Metric.

Choose your EC2 instance. For Namespace, use CWAgent. For Metric name, use procstat_cpu_time.

Under Conditions, for Threshold type, choose Static. Complete the remaining fields as shown in Figure 4.

Under Whenever procstat_cpu_usage, the threshold is set to lower than 0.01. Under Datapoints to alarm, 2 out of 3 datapoints must be breaching to cause the alarm to go to ALARM state. Under Missing data treatment, Treat missing data as bad (breaching threshold) is selected.

Figure 4: Create a static CloudWatch alarm

In Configure actions, under Alarm state trigger, choose In alarm. Under Select an SNS topic, choose to send the alarm notifications to an existing SNS topic. You can choose Create a new topic if you don’t already have one. Under Send a notification to, choose Notify_By_Email, as shown in Figure 5:

 

The fields in Configure actions are set as described in the post

Figure 5: Configure CloudWatch alarm actions

For the alarm name, enter Monitor_Process_CW_Alert and then choose Create alarm.

Integrate EventBridge with Systems Manager

Now integrate EventBridge with Systems Manager to trigger the runbook to send the SSM document to the EC2 instance. For more information, including a sample event from CloudWatch, see Alarm events and EventBridge in the CloudWatch User Guide. For information about how to create a custom event pattern for a CloudWatch event rule, see this AWS Knowledge Center article. You get the event pattern shown here after you’ve created the CloudWatch alarm.

In the Amazon EventBridge console, choose Events, choose Rules, and then choose Create Rule. Create the rule with a custom pattern as shown in Figure 6.

In Event pattern, paste the following:

{ "detail-type": ["CloudWatch Alarm State Change"], "source": ["aws.cloudwatch"], "detail": { "alarmName": ["Monitor_Process_CW_Alert"], "state": { "value": ["ALARM"] }, "previousState": { "value": ["OK"] } }
}

The rule name is AWS_Alarm_To_Trigger_SSM_Runbook. Under Define pattern, Custom pattern is selected. Event pattern displays the pasted code

Figure 6: Creating CloudWatch alarm rule with a custom pattern

In Select targets, for Target, choose SSM Automation. For Document, choose Monitor_Process_Automation_Document.

Expand Configure automation parameter(s) and choose Input Transformer. In the first field, enter: {"instance": "$.detail.configuration.metrics[0].metricStat.metric.dimensions.InstanceId"}

In the second field, enter: {"InstanceId":[<instance>]}

Choose Use existing role and then choose AutomationServiceRole.

The fields for the Automation document are completed as described in the post

Figure 7: Using input transformers with Automation

Choose Create. Figure 8 shows the CW_Alarm_To_Trigger_SSM_Runbook rule.

The details page for the rule displays the values used in the procedure steps, including the event pattern and target values

Figure 8: EventBridge rule

Test the SSM automation

Use SSH to connect to the EC2 instance. Run this command to stop Apache HTTP Server:

$ sudo systemctl stop httpd

Run this command to verify that the server has stopped:

$ sudo systemctl status httpd

Wait a few seconds for the runbook to trigger and start the process again:

$ sudo systemctl status httpd

You’ll see that the httpd server process is running again because the automation was triggered by the EventBridge rule.

Cleanup

To avoid ongoing charges in your account, delete the EC2 instance, CloudWatch alarm, and SSM document.

Conclusion

In this post, we showed how you can use EventBridge rules to run Systems Manager automation at scale in response to CloudWatch alarms. We hope you use the information in this post to add process-level metrics and automation in your organization.

About the authors

riz-mg.jpg

Rizwan Mushtaq

Rizwan is a Senior Solutions Architect at AWS. He helps customers design innovative, resilient, and cost-effective solutions using AWS services. He holds an MS in Electrical Engineering from Wichita State University.

sthapar.jpg

Sahil Thapar

Sahil Thapar is an Enterprise Solutions Architect. He works with customers to help them build highly available, scalable, and resilient applications on the AWS Cloud. He is currently focused on containers and machine learning solutions.

ramarake.jpg

Rakesh Ramadas

Rakesh Ramadas is an ISV Solution Architect at AWS. His focus areas include AI/ML and big data.