This is the second post in a two-part series about AWS Systems Manager Incident Manager. In the first post, we covered onboarding steps like creating contacts, an escalation plan, and a response plan in Incident Manager. In this post, we discuss the integration between Incident Manager and Amazon CloudWatch and how Incident Manager components manage an incident. We also show you how to do post-incident analysis.

Many customers use Amazon CloudWatch to monitor their AWS environments. Incident Manager integrates with Amazon CloudWatch to start an incident when an alarm reaches a defined threshold. Start by creating an alarm.

  1. From the left navigation pane of the Amazon CloudWatch console, choose Alarms, and then choose Create alarm.

The Alarms page in the CloudWatch console displays columns for alarm name, state, last state update, conditions, and actions. In this example, there are no alarms to display.

Figure 1: Alarms page

  1. You will be directed to a page to specify metrics and conditions. Choose Select metric to display a list of metrics you can monitor. Choose EC2, and then choose Per-Instance Metrics.

Select metric displays an untitled, empty graph. Under Metrics, there are 14 EC2 metrics, 7 Events, 16 Lambda, and 4 Logs.

Figure 2: Select metric

  1. Find the instance you created for part 1 of this series, choose the CPUUtilization metric, and then choose Select metric. For Period, enter a custom value of 120 seconds.

In the Metrics list, the CPUUtilization metric is selected.

Figure 3: CPUUtilization metric

  1. You can now specify the condition of this CloudWatch alarm. Under Conditions, define a threshold value of 85 and then choose Next.

In Conditions, under Threshold type, Static is selected. Under Whenever CPUUtilization is, Greater is selected. The threshold value is set to 85.

Figure 4: Conditions

  1. Now configure what happens when the alarm is triggered. Under Systems Manager action, choose Create incident. In the Response Plan search field, enter EC2HighCPU. Choose Next.

Under Systems Manager action, the Create SSM Incident option is selected. When the alarm is in the In Alarm state, the action will start an incident using the information in the response plan.

Figure 5: Systems Manager action

  1. Enter a name for the alarm, choose Next, and then choose Create alarm.

You have now finished setting up a CloudWatch alarm that will trigger an incident.

Incident management with Incident Manager

To simulate the incident of high CPU utilization, use the stress command. Connect to your EC2 instance and use the following commands to install stress:

sudo amazon-linux-extras install epel -y
sudo yum install stress -y

Now use the following command to impose some stress on your EC2 instance. It runs three CPU stressors for 300 seconds.

sudo stress --cpu 3 --timeout 300

When the CPU utilization hits 85%, you will find a new incident on the Incident Manager page of the Systems Manager console. At this time, the primary responder you added in the Contacts section will receive a message about the incident. If you created a chat channel, it will receive an incident notification.

The cpu-incident-1 page shows the incident has an impact of Medium and a duration of 1 minute. It includes a CPUUtilization graph.]

Figure 6: cpu-incident-1

When your responder gets engaged, they can view the metrics and runbook related to the incident. Because you created a manual runbook, you can choose Resume after you complete each step.

On the Runbook tab, ec2-cpu-util-runbook (v4) has a status of Waiting, which means the AttachToAutoScalingGroup step is not complete. The Inspect and AddMetrics steps have a status of Success.

Figure 7: ec2-cpu-util-runbook

After you have finished all the steps in the runbook and have confirmed that the incident has been mitigated, you can choose Resolve incident. The Timeline tab displays the events related to the incident.

The Timeline tab displays events related to cpu-incident-1, including Incident Started, CPUUtilization added to metrics, and Incident status updated to RESOLVED.

Figure 8: Timeline tab

Post-incident analysis

It helps to see every incident as an opportunity to learn and improve your processes. Your team becomes stronger by thinking through the issue, identifying its root cause, assessing how it could have been prevented, and assigning clear follow-up action items. Root cause analysis requires you to ask the right questions and carefully note any assumptions. If an incident involves multiple teams, it’s a good practice to use post-incident analysis to coordinate among the teams rather than allowing each (decentralized) team to focus on their own services.

Incident Manager provides a feature you can use to document the post-incident analysis. When you resolve the incident, a banner shown in Figure 8 displays a Create analysis button.

Create analysis displays a Title field (in this example, cpu-incident-1) and a Template dropdown (in this example, AWSIncidents-PostIncidentAnalysisTemplate).

Figure 9: Create analysis

For Template, choose AWSIncidents-PostIncidentAnalysisTemplate, and then choose Create.

You will be prompted to answer a list of questions related to the detection, diagnosis, and mitigation of this incident. This step can be extremely valuable for teams, to help them reflect on what went well and what could be improved for incident prevention and mitigation.

The post-incident analysis for cpu-incident-1 displays the Region (us-east-1), AWS accounts, and Status (Completed). It also includes Summary and Impact sections.

Figure 10: Post-incident analysis for cpu-incident-1

Incident Manager integrates with AWS Systems Manager OpsCenter to help you track incidents and post-incident action items from a central place. It also synchronizes with popular third-party incident management tools such as Jira Service Desk and ServiceNow.

The post-incident analysis for cpu-incident-1 includes two action items: Update detection alarm threshold and Update incident runbook. Both have a priority of High.

Figure 11: Action items

Cleaning up

To avoid ongoing charges in your AWS account, clean up the resources you created in this walkthrough. To delete the CloudWatch alarm, select the alarm you created, choose Actions, and then choose Delete. To terminate an EC2 instance, select the EC2 instance, choose Instance state, and then choose Terminate instance.

To delete the resources created from Part 1 of this series, including contacts, response plan and escalation plan, please refer to Incident Manager User Guide.

Conclusion

In this blog post, we showed you how to use Incident Manager with Amazon CloudWatch to prepare for and mitigate incidents. Post-incident analysis is key to successful incident management. Incident Manager allows you to perform this analysis by providing templates and mechanisms to help your team reflect on the wins and growth areas after incidents occur. For more information about Incident Manager, see the AWS Systems Manager User Guide.

About the authors

Harshitha Putta

Harshitha Putta

Harshitha Putta is a Senior Cloud Infrastructure Architect with AWS Professional Services in Seattle, WA. She is passionate about building innovative solutions using AWS services to help customers achieve their business objectives. She enjoys spending time with family and friends, playing board games and hiking.

Guyu Ye

Guyu Ye

Guyu Ye is a Cloud Architect at AWS based in Austin, TX. She enjoys helping customers simplify complex problems through technology. During her free time, she likes spending time with friends and family, hiking with her adorable pup Albert, teaching/taking yoga classes, and working on random DIY projects.