Many of our customers need an effective incident management and response solution to achieve operational excellence and performance efficiency. Transparency between those who are affected by the incident and those who respond to the incident is key to any incident management process. Finding the right team to mitigate the impact of application or workload incidents can often take hours. On top of that, the incident response process is usually manual and involves lots of uncertainties. This is not desirable, especially when it comes to critical applications that can impact revenue and reputation.

Customers often ask us how we manage incidents internally. To simplify incident response management, we have just released a new AWS Systems Manager capability, Incident Manager, that incorporates the best practices we follow for internal incident management at Amazon. When you use Incident Manager, you engage the right responders at the right time, track incident updates, automate remediation actions, and enable chat-based collaboration.

This is the first in a two-part series. In this post, we discuss prerequisites, onboarding, and setting up incident management components. In the second post, AWS Systems Manager Incident Manager integration with Amazon CloudWatch, we discuss how Incident Manager integrates with Amazon CloudWatch, how Incident Manager components manage an incident, and the importance of post-incident analysis.

What is an incident?

An incident is an issue that occurs in your AWS-hosted applications. Consider an application running on an Amazon Elastic Compute Cloud (Amazon EC2) Linux instance that is using Amazon CloudWatch for monitoring. This application experienced a huge spike in traffic and CPU utilization went above 85%. The application was not prepared to handle the higher loads, so its performance was degraded. During situations like this, you need a response plan that engages the right responders to inspect and mitigate the incident.

To mitigate the incident’s impact, you must be able to identify the right contacts. In scenarios where an immediate response is required and the first contact cannot be reached, you need an escalation plan that allows you to escalate incidents to another set of contacts. After the right contact is engaged to manage the incident, you need a runbook that includes detailed steps to mitigate the incident. Especially during critical incidents, responders are under a lot of stress and prone to making mistakes. A runbook is handy because it instructs responders on how to inspect and manage the incident. The combination of contacts, escalation path, and a runbook of instructions is called a response plan.

Prerequisites

To complete the steps in this walkthrough, you need the following:

  • An AWS account and AWS Identity Access and Management (IAM) permissions to access Amazon EC2, AWS Systems Manager, and Amazon CloudWatch. Your IAM user or role should also have iam:CreateServiceLinkedRole permissions. Incident Manager uses this permission to create the AWSServiceRoleforIncidentManager in your account. For more information, see Using service-linked roles for Incident Manager.
  • An IAM role that allows Incident Manager to start an AWS Systems Manager automation runbook. For more information, see Setting Up Automation.
  • An EC2 Linux instance with detailed monitoring enabled. For this walkthrough, t2.micro is sufficient. This instance should not be currently attached to an Auto Scaling group because later in this post we provide a code snippet that creates a manual runbook that instructs the incident responder to attach the instance to an Auto Scaling group. For information about how to connect to your EC2 instance, see Connect to your Linux instance in the Amazon EC2 User Guide for Linux Instances.

Complete onboarding steps

If this is your first time using Incident Manager, you need to complete the onboarding steps. If you have already completed these steps, skip ahead to the “Add contacts” section of the walkthrough.

  1. Sign in to the AWS Management Console and search for Incident Manager.
  2. On the Incident Manager page, choose Prepare.

Incident Manager page displays a How it works section, documentation links, and a Prepare button.

Figure 1: Incident Manager page

  1. The onboarding wizard displays the workflow for setting up your incident response plan. Under General settings, choose Set up.

Under How it works, there are steps for configuring general settings, defining contacts and contact channels, creating an escalation plan, and combining contacts, escalation plans, runbooks, automation, and metrics into a response plan.

Figure 2: Incident response plan workflow

  1. On Terms and Conditions, read the text carefully, select the checkbox, and then choose Next.

The checkbox on Terms and Conditions reads “I have read and agree to the AWS Incident Manager terms and conditions.

Figure 3: Terms and Conditions

  1. The next step is to create a replication set. The replication set uses an AWS Key Management Service (AWS KMS) key to encrypt your data and replicate it across multiple AWS Regions. You can use an AWS-owned key or your own customer-managed key. In this walkthrough, choose one AWS Region for the replication set. You can add more Regions to the replication sets later. For more information, see Using the Incident Manager replication set.

Under Region, "us-east-1" is selected. Under KMS Encryption, select "Use AWS owned key".

Figure 4: Replication set creation

After you create the replication set, you can start adding contacts.

Add contacts

Your contacts should include everyone who might be involved in the incident. Follow these steps to add a contact.

  1. In the AWS Systems Manager console, expand Operations Management, and then expand Incident Manager.
  2. Choose Contacts, and then choose Create contact.

Contacts page displays View details, Edit, Delete, Test, and Create contact buttons. It also includes a search field for finding contacts.

Figure 5: Create contact

  1. On Contact information, enter names and define contact channels for your contacts. Contact channels are methods you’ll use to engage your contact when an incident occurs.
  2. Under Contact channel, you can choose from email, SMS, and voice. You can also add multiple contact channels. In an actual scenario, replace the contact name with the incident responder name instead of a general term to reinforce the purpose.

In the Name field, incident responder is displayed. Under Contact channel, the email address and mobile phone for the incident responder are displayed.

Figure 6: Contact information

  1. In Engagement plan, you can specify how fast to engage your responders. In Figure 7, the incident responder will be engaged through email 0 minutes into an incident and 10 minutes into an incident through SMS. Complete the fields and then choose Create.

Under Contact channel name, incident responder email and incident responder mobile are selected from the dropdown lists. Under Engagement time (min), a value of 0 is selected for incident responder email. A value of 10 is selected for incident responder mobile.

Figure 7: Engagement plan

The contact you added will receive a six-digit activation code to confirm the contact channels. After the contact is activated, the contact can be part of an escalation plan.

On Contact channel activation, there is an Activation code field where the incident responder can enter a six-digit code.

Figure 8: Contact channel activation

  1. (Optional) Repeat steps 1-5 to create another contact. In this walkthrough, we created another contact with the alias of sr-incident-responder. In a real-world scenario, the contact details should be an actual person or team.

Create an escalation plan

Follow these steps to set up an escalation plan.

  1. In the AWS Systems Manager console, expand Operations Management, and then expand Incident Manager.
  2. Choose Escalation plans, and then choose Create escalation plan.

Escalation plan displays View details, Edit, Delete, Test, and Create escalation plan buttons. It also includes a search field you can use to find escalation plans.

Figure 9: Escalation plan

  1. On Create escalation plan, you can specify the number of stages you want to include in the escalation plan. Under Stage 1, for Stage duration, enter 5. Choose the first contact you created as the primary contact. Select or clear the checkbox next to Contact name to indicate whether to stop escalation after your contact in Stage 1 has acknowledged the incident.

On Create escalation plan, for Contact name, incident-responder is selected from the dropdown. The Contact acknowledgement stops plan progression checkbox is selected.

Figure 10: Contact acknowledgement stops plan progression checkbox

  1. If you created another contact, you can add that contact in Stage 2. Because you set the Stage duration to 5 in Stage 1, if the responder in Stage 1 does not acknowledge the incident five minutes into the engagement, the next set of contacts, sr-incident-responder, will be engaged.

Under Stage 2, for Contact name, sr-incident-responder is displayed. The Contact acknowledgement stops plan progression checkbox is selected.

Figure 11: Stage 2

Create a response plan

Now you’re ready to create a response plan for the incident, which ties together the contacts, escalation plan, and runbook. When an incident occurs, a response plan defines who to engage, how to engage, which runbook to initiate, and which metrics to monitor. By creating a well-defined response plan, you can save teams time down the road.

  1. In the Systems Manager console, expand Operations Management, and then expand Incident Manager.
  2. Choose Response plans, and then choose Create response plan.

Response plans displays View details, Edit, Delete, and Create response plan buttons. It also includes a table (in this example, empty) with columns for name, chat channel, engagements, and runbook.

Figure 12: Response plans

  1. In Create response plan, for Name, enter EC2HighCPU. For Title, enter cpu-incident-1. For Impact, choose Medium.

Create response plan shows fields completed with values used in the procedure.

Figure 13: Create response plan

You can optionally configure an AWS Chatbot channel as part of the response plan so that incident responders can communicate over chat. For more information, see Setting up AWS Chatbot in the AWS Chatbot Administrator Guide.

Chat channel section displays fields where you can select the chat channel and SNS topics.

Figure 14: Chat channel

  1. Under Engagements, choose the escalation plan you created earlier.

On Engagements, the app-escalation-plan is displayed.

Figure 15: Engagements

  1. In Runbook, choose Clone runbook from template, and enter Runbook name ec2-cpu-util-runbook.
  2. Under Execution permissions, choose Create an IAM role using a template. Under Role name, select the IAM role you created in Prerequisite that allows Incident Manager to run SSM automation documents, and then choose Create response plan.

Runbook displays options and role name as described in the procedure.

Figure 16: Runbook

  1. You can now view the response plan you created. Under Document, choose the runbook you cloned from the default template (ec2-cpu-util-runbook). You will edit the runbook to make it more applicable to the incident.

Runbook displays fields for document (in this example, ec2-cpu-util-runbook), role name (ssm-automation), execution target, and document version.

Figure 17: ec2-cpu-util-runbook

  1. In the SSM Automation document cloned from the default runbook, there are generic steps to mitigate an incident. Under Actions, choose Create new version.

Under Document description, there are fields for platform (Windows, Linux, macOS), created, owner, target type, and status.

Figure 18: Document description

  1. On Create new version, choose the Editor tab, and then choose Edit. When you see a message that says you can’t return to the Builder tab after you choose the Editor tab, choose OK.

Create new version includes a section for document details (name, document type, and default version) and Builder and Editor tabs.

Figure 19: Editor tab

Copy and paste the following code into the Document editor field, and then choose Create automation. This code snippet creates a manual runbook that instructs the incident responder to attach the instance to an Auto Scaling group.

description: |- This runbook walks you through how to resolve high CPU utilization of your EC2 instance by adding your instance to an Auto Scaling group. schemaVersion: '0.3'
mainSteps: - name: Inspect action: 'aws:pause' inputs: {} description: |- Navigate to the **Amazon CloudWatch** console, determine the instance in question. Copy the source code of the metrics for CPU utilization. - name: AddMetrics action: 'aws:pause' inputs: {} description: |- Navigate back to **Incident Manager**, click the incident you are engaged to resolve. Under the **Metrics** tab, click **Add**. Select **From CloudWatch metrics**, and paste the metric source. Now you can start monitoring the CPU utilization throughout the incident. - name: AttachToAutoScalingGroup action: 'aws:pause' inputs: {} description: |- 1. Open the **Amazon EC2** console, navigate to **Instances**. Select the instance in question. 2. Choose **Actions**, then **Instance settings**. Click **Attach to Auto Scaling Group**. 3. On the *Attach to Auto Scaling group* page, for *Auto Scaling Group*, enter a name for the group, then choose **Attach**. The new Auto Scaling group is created using a new launch configuration with the same name that you specified for the Auto Scaling group. The launch configuration gets its settings from the instance that you attached. 4. On the left pane of the Amazon EC2 console, under **AUTO SCALING**, choose **Auto Scaling Groups**. Select the checkbox next to the new Auto Scaling group you have just created, and choose the **Edit** button. Change the setting to Max size = 3. Choose **Update**. 5. Under Automatic scaling, click **Add policy**. Choose *Target tracking scaling* and set the *Metric type* to *Average CPU utilization* of 50. Click **Create**. - name: Validate action: 'aws:pause' inputs: {} description: |- Navigate back to the incident under **Incident Manager**. Check out the **Metrics** tab. Observe the changes and validate that the incident has been resolved. - name: CloseIncident action: 'aws:pause' inputs: {} description: |- After you validate that the incident has been resolved, you can close the incident.

You have now created a response plan that engages your incident responders and provides a predefined runbook to help them mitigate the incident.

In the second blog post of this series, we discuss how the components you created can be used for managing an incident.

Conclusion

In this blog post, we showed how you can use Incident Manager to prepare for and mitigate incidents. Preparation is key to successful incident management. Incident Manager helps you create escalation plans and customized runbooks. It offers more capabilities, too, such as the ability to engage your responders through Slack and the use of automated runbooks. For more information, see Incident Manager in the AWS Systems Manager User Guide.

About the authors

Harshitha Putta

Harshitha Putta

Harshitha Putta is a Senior Cloud Infrastructure Architect with AWS Professional Services in Seattle, WA. She is passionate about building innovative solutions using AWS services to help customers achieve their business objectives. She enjoys spending time with family and friends, playing board games and hiking.

Guyu Ye

Guyu Ye

Guyu Ye is a Cloud Architect at AWS based in Austin, TX. She enjoys helping customers simplify complex problems through technology. During her free time, she likes spending time with friends and family, hiking with her adorable pup Albert, teaching/taking yoga classes, and working on random DIY projects.