Bottlerocket is an open source Linux-based operating system from Amazon that was purpose built for running containers with a strong emphasis on security. The result is an operating system that comes with a variety of built-in controls for creating a secure environment for running containerized workloads. In this post, we’ll explore several of the security features available in Bottlerocket and how they protect your environment.

Introduction

When you run a container using a container orchestration service such as Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS), there are a variety of things a security practitioner must consider. Securing the software supply chain, consistently enforcing policies across different clusters, and securing the infrastructure are a few important examples. At Amazon Web Services (AWS), we recommend a layered approach to security, where implementing protections at different levels of the stack is paramount. This layered approach includes safeguarding the host—the computer on which your container runs.

If you’re new to containers, you might wonder why hardening the host—in addition to all the other layers—is so important. Containers are processes that run on a shared operating system kernel. Isolation between containers is achieved using Linux namespaces. Namespaces are a Linux kernel feature that partitions kernel resources at the operating system level. Docker containers use Linux kernel namespaces to restrict any user, including root, from directly accessing the machine’s resources. Yet namespaces, by themselves, are not considered a strong security boundary. They are simply a way to make a global resource appear to be unique and isolated. Runtimes such as Docker and Containerd also run containers as root by default, which could allow escalation of privileges if other compensating controls aren’t in place, such as forcing applications to run as a non-root user.

Having strong isolation or mitigating controls is important in multi-tenant environments in which the infrastructure for running containers may be shared among different customers. Code running in environments like this should be treated as if it is untrustworthy. For example, a skilled attacker could exploit a configuration issue with a container, applications, or a cluster to gain access to the underlying host. Once on the host, they could potentially gain access to resources beyond the container’s scope.

Bottlerocket security overview

In this overview, we start with an examination of Bottlerocket’s surface area, or the attack vectors an unauthorized party could potentially use to gain access to the operating system. We describe Bottlerocket’s image-based updates powered by The Update Framework (TUF). We review kernel lockdown and how that feature can prevent unsigned modules from being added to the operating system. We also examine the impact of Bottlerocket’s read-only root file system and its use of dm-verity for verifying file system integrity. And we dive into SELinux and how Bottlerocket’s built-in policies can constrain the behavior of containers running on host. Examples throughout this post help illustrate how SELinux can thwart different types of attacks. Finally, we examine Bottlerocket’s use of ephemeral storage and how configuration changes are persisted across reboots. Along the way, we provide best practices for securing an environment even further.

Reduced attack surface

The first way Bottlerocket improves a security posture is by removing all shells, interpreters, and package managers from the Bottlerocket image. When running Bottlerocket, instead of installing packages on the host, you run additional software in containers.

Bottlerocket provides the control and admin containers by default. These containers are based on Amazon Linux 2 and have access to the packages available in Amazon’s Yum repositories, but you can also build container images with your preferred operating system and package manager. These images can be run using your preferred orchestrator, such as Kubernetes or Amazon ECS, or with Bottlerocket’s host container feature.

Bottlerocket’s API-first/container-centric approach also helps simplify fleet management. For example, Bottlerocket integrates with AWS Systems Manager, which is collection of services that you can use to view and control your infrastructure on AWS, including Bottlerocket instances.

The preferred option when running Bottlerocket in the AWS Cloud is to use Bottlerocket’s control container, which runs outside of the orchestrator in a separate instance of containerd and is accessible via AWS Systems Manager Session Manager. With AWS Systems Manager Session Manager, access to the operating system can be regulated with AWS Identity and Access Management (IAM). The control container runs with the control_t SELinux label and mounts the Linux socket for the Bottlerocket API. Together, they provide the ability to configure the Bottlerocket operating system or enable the admin container when it is warranted.

What if you want to customize the Bottlerocket image?

There is a mechanism to create your own variant of Bottlerocket with software or customizations; however, there are a couple of caveats.

First, custom images will no longer trust the official Bottlerocket images. If you want to switch to the official builds of Bottlerocket in the future, you must replace all of your nodes with an AWS variant.

Second, if you want to support in-place upgrades, you must decide where to store the signing keys and how your repository contents will be delivered.

To learn more about publishing your own images, review the Publishing Bottlerocket documentation.

Bottlerocket is meant to be versatile enough to fit your use-cases so you don’t have to build and maintain a variant. If you do need to run additional packages, consider running them in a container. Similarly, if you need to customize the operating system, try using a bootstrap container before building a variant.

Image-based updates

Upgrading an instance running Bottlerocket is similar to a firmware upgrade on a physical device or applying an operating system update to your phone. You don’t need to install a bunch of software packages to get the new version. Instead, download a full system disk image and apply the update, which is written to an alternate, inactive partition. To activate the upgrade, reboot the instance, which now boots from the new primary partition with a new version of the operating system. All the data and persistent configuration is stored on a separate data partition and is available after the reboot.

With Bottlerocket, we’re committed to providing timely image updates as patches are released. These updates can be applied by running apiclient update apply --check --reboot from the admin or control containers or by using the Update Operator for Kubernetes, or ECS Updater for Amazon ECS. The operator and updater solutions are similar in that they both drain containers from instances before performing the update. The updates occur in waves, rather than all at once, to reduce the impact of issues that might occur during the upgrade. For additional information about waves, refer to the update waves documentation.

The Bottlerocket image is vended through an update repository protected with The Update Framework (TUF). TUF also provides the mechanism Bottlerocket uses for doing secure, in-place upgrades of the operating system.

Embedded within each Bottlerocket image is a root.json file that begins the chain of trust in that it lists the keys that Bottlerocket will trust. This file is signed by multiple keys to hinder an attacker’s ability to replace it with a different version. The TUF repository also includes a targets.json (also signed) file, which lists all the available target files in the repository and their hashes. Any file listed in the manifest is considered a TUF target and can only be downloaded from the TUF repository, thereby preventing Bottlerocket from downloading untrusted data, including untrusted images. This helps ensure the ongoing integrity of the software supply chain for Bottlerocket.

Managed node groups

The Bottlerocket update operator provides a good mechanism for upgrading Bottlerocket instances if you are using self-managed Kubernetes or Amazon EKS self-managed nodes; however, if you’d like a fully managed experience, try deploying Bottlerocket with EKS-managed node groups. Unlike the update operator, a managed node group can be instantiated with AWS CloudFormation or the Amazon EKS API. They are used to replace the instances in your node group with a newer or different version of the Bottlerocket AMI.

Rather than performing the upgrade in waves, managed node groups gradually replace the instances in the managed node group according to its max concurrency setting. Additionally, with a managed node group, you are notified when an update is available in the console and you can trigger the upgrade to occur at a precise time.

Replacing instances after a specific “time to live” has lapsed may be another reason to consider using managed node groups with Bottlerocket. By recycling hosts periodically, you can disrupt an attacker who has managed to compromise the host’s kernel. We examine how to mitigate the risk of a kernel security issue in the next section.

Kernel lockdown

Suppose that an attacker was able to escape from a container and access the underlying host as root. How can Bottlerocket lessen the severity of security incident?

The first way is to configure kernel lockdown in integrity mode. This limits an attacker’s ability to overwrite the kernel’s memory or modify its code. It also can prevent an attacker from loading unsigned kernel modules. Only kernel modules included in the Bottlerocket image can be loaded.

On earlier versions of Bottlerocket variants, kernel lockdown is set to none, which disables the protection; however, on newer versions such as aws-ecs-1, aws-k8s >= 1.20, aws-dev, and VMware variants, kernel lockdown is set to integrity mode.

Alternatively, you can set kernel lockdown to confidentiality mode. This limits the ways you can read the kernel’s memory from user space. The primary purpose of this mode is to protect secrets that are stored in the kernel. Given the impact of confidentiality mode has on eBPF, perf, and other tools that rely on reading kernel memory, we generally recommend using integrity mode with Bottlerocket.

To set kernel lockdown to integrity mode using the TOML file referenced in the instance’s user-data:

[settings.kernel]
lockdown = "integrity"

Alternatively, you can run apiclient set kernel.lockdown=integrity from the control container.

Note: Changing the lockdown setting may require a system reboot before taking effect.

Read-only root file system and dm-verity

Besides trying to add kernel modules, an attacker might try replacing files on the root file system with malicious versions. Once again, Bottlerocket can offer additional layers of protection by preventing changes to files on the root filesystem by mounting rootfs as a read-only volume.

Bottlerocket also makes use of dm-verity, a Linux kernel module that provides transparent integrity checking of block devices using a cryptographic digest. dm-verity maintains file system integrity by catching accidental or deliberate changes to the data and failing closed (by panicking) when such changes occur. Each time the system boots, the dm-verify verifies integrity of the root filesystem and will refuse to boot the operating system when there’s an error or evidence of corruption. This helps protect against some container escape vulnerabilities, such as CVE-2019-5736.

Example

To view this in action by overwriting the dm-verity hash, start by logging into the Admin container. Observe the host’s root file system mounted at /.bottlerocket/rootfs and that it is mounted as read-only.

Note: This will render the node inoperable. Do not attempt on a node that you need to retain.

[[email protected] /]$ findmnt
TARGET SOURCE FSTYPE OPTIONS ├─/.bottlerocket/rootfs dev/dm-0 ext4 ro,relatime,seclabel

Next, run sheltie within the admin container and then run the dd command as shown in the following. dm-verity will identify hash corruption and immediately reboot the node.

bash-5.0# dd if=/dev/zero of= bs=1M count=1
bash-5.0# echo 1 > /proc/sys/vm/drop_caches

Running blkid finds the hash partition. The partition will be of type DM_verity_hash.

The node may reboot multiple times before it is marked as unhealthy and replaced by the Auto Scaling Group.

SELinux

Linux typically restricts access to objects using discretionary access controls (DAC), where access is governed by the identity of the user. For example, the permission mask set on a resource is controlled by the owner of that resource.

SELinux is a Linux security module that uses a rules-based system that specifies the actions that are allowed to be performed against resources. These rules override the permissions granted by DAC. If a rule is not specified for a particular action, the action is automatically denied. This is known as mandatory access control (MAC).

With SELinux, administrators use a labeling scheme to identify different system resources—for example, processes, sockets, and files. These labels are referenced in the rules SELinux enforces.

Running SELinux adds yet another layer of protection against unknown vulnerabilities in your applications or zero days.

With Bottlerocket, we run SELinux in enforcing mode by default and the kernel is compiled in such a way that it prevents SELinux from being disabled or from enforcement mode being turned off. Additionally, the built-in rules on Bottlerocket automatically restrict the resources that can be accessed by containers that run on the system. Let’s walk through how this affects container escapes on Bottlerocket.

As an attacker, once I get access to the host as the root user, it’s usually game over. As root, I can inflict much damage. For example, I can install a kernel module to circumvent protections that depend on the kernel for enforcement, run additional processes, change the configuration of the operating system, and so on.

With Bottlerocket, however, all unprivileged containers are automatically assigned the restrictive container_t label. This constrains the actions that the container and its child processes can perform against the host operating system, even when the container is run as root.

Example

Suppose that an attacker managed to mount the Bottlerocket API socket inside an unprivileged pod. As the following output shows, the socket is mounted at /run/api.sock. When the attacker tries using the apiclient to output the configuration of Bottlerocket, they will get an error.

$ df -h
Filesystem Size Used Avail Use% Mounted on
overlay 79G 5.1G 71G 7% /
tmpfs 64M 0 64M 0% /dev
tmpfs 3.8G 0 3.8G 0% /sys/fs/cgroup
tmpfs 1.6G 1.4M 1.6G 1% /run/api.sock
shm 64M 0 64M 0% /dev/shm
/dev/nvme1n1p1 79G 5.1G 71G 7% /etc/hosts
/dev/root 906M 619M 225M 74% /usr/bin/apiclient
tmpfs 3.8G 12K 3.8G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 3.8G 0 3.8G 0% /proc/acpi
tmpfs 3.8G 0 3.8G 0% /sys/firmware
$ apiclient -u /settings
Failed GET request to '/settings': Failed to send request: error trying to connect: Permission denied (os error 13)

The SELinux policy on Bottlerocket prevents containers with the container_t label from accessing a socket with api_socket_t label. Only privileged containers or containers that run with the super_t label (also known as superpowered containers) can access the API socket.

$ ls -Z /run/api.sock
system_u:object_r:api_socket_t:s0 /run/api.sock

The following output shows that the container’s processes are labeled with the container_t label.

$ /proc/1/root/run# ps -aZ
LABEL PID TTY TIME CMD
system_u:system_r:container_t:s0:c458,c557 17 pts/0 00:00:00 sh
system_u:system_r:container_t:s0:c458,c557 18 pts/0 00:00:00 bash
system_u:system_r:container_t:s0:c458,c557 273 pts/0 00:00:00 ps

Reviewing the SELinux rules, we can notice that only privileged or containers that run with the control_t label are allowed to access the API.

; Only the API server and specific components can use the API
; socket, as this provides a means to escalate privileges and
; persist changes.
(allow api_s api_socket_t (files (mutate)))
(allow control_s api_socket_t (files (mutate))) ; Unprivileged components are not allowed to use the API socket.
(neverallow unprivileged_s api_socket_t (files (mutate)))

Containers that run as privileged are assigned the control_t label, which allows writes to Bottlerocket’s API socket in addition to the actions allowed under the container_t domain.

Write access to this socket grants full control to Bottlerocket’s system configuration. This includes the ability to define an arbitrary source for host containers, and to run those containers with “superpowers” that can bypass other restrictions.

Ideally, you want to prevent containers scheduled through your orchestrator from mounting Bottlerocket’s API socket and running as privileged unless it is necessary. This can be enforced through OPA/Gatekeeper or Kyverno, two policy as code solutions that run as admission controllers within your Kubernetes cluster.

When you do need to modify the configuration of the operating system, we recommended using Bottlerocket’s control container instead. The control container, unlike a container scheduled through your orchestrator, is intended to be accessed from SSM Session Manager, which includes an option to log all commands issued during the interactive session to Amazon Simple Storage Service (Amazon S3). This log can be used for auditing purposes when necessary.

On Kubernetes, you may want to consider using Pod Security Standards (PSS) to configure the appropriate pod restrictions once it emerges from alpha. PSS establishes a set of security profiles with different permissions. For example, the privileged profile is unrestricted and is aimed at system or infrastructure-type workloads, whereas the baseline profile is aimed at a majority of workload where privileged access is unnecessary. Under this scheme, pods run for the purpose of configuring Bottlerocket could run under privileged, while all other pods could run under the baseline or restricted profiles.

In the next example, an attacker has managed to compromise a privileged container and is going to try to overwrite (delete) an image layer. As the following output shows, the image layers are labeled with the cache_t label.

$ /mnt/local/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256# ls -Z
system_u:object_r:cache_t:s0 021c8ee1007da9a5b1241afbdd155df3f4cdd8698d074778c387019bd352a0d8
system_u:object_r:cache_t:s0 02ed246c254c5a7bdaadb5ceedf02619a655c626795921e4912647b8e9f41ae5
system_u:object_r:cache_t:s0 039713e93bc9b5976fb596a3b5eed3370eba602761b80b99022d7276f597b569

Instead of overwriting a layer, we’re going to try deleting a layer, which requires write privileges. Despite being root and running as privileged, the action is denied.

$ /mnt/local/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256# rm 021c8ee1007da9a5b1241afbdd155df3f4cdd8698d074778c387019bd352a0d8 rm: cannot remove '021c8ee1007da9a5b1241afbdd155df3f4cdd8698d074778c387019bd352a0d8': Permission denied

The SELinux log shows that write action was blocked by Bottlerocket’s SELinux policy. The policy hinders an attacker’s ability to tamper with the image layers of other containers that are cached on the host persistent storage.

$ /mnt/local/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256# dmesg
[ 0.000000] Linux version 5.10.50 ([email protected]) (x86_64-bottlerocket-linux-gnu-gcc (Buildroot 2021.02.3) 10.3.0, GNU ld (GNU Binutils) 2.35.2) #1 SMP Wed Aug 4 17:13:18 UTC 2021
...
[2522874.479664] audit: type=1400 audit(1631131343.009:4): avc: denied { write } for pid=2066233 comm="rm" name="sha256" dev="nvme1n1p1" ino=3662970 scontext=system_u:system_r:control_t:s0 tcontext=system_u:object_r:cache_t:s0 tclass=dir permissive=0

The rules show that the runtime is the only component that mutates files with the cache_t label.

; Only specific components can write to these objects, as they
; provide a means to persist changes across container restarts
; and reboots.
(allow runtime_s cache_t (files (mutate)))

The super_t label allows a majority action on the host (those defined in the SELinux policy) and should be reserved solely for the admin container. On Bottlerocket, the super_t label is marked as permissive, which means that SELinux policies are not enforced, but denials for actions that would have been denied if you were running in enforcing mode are still logged.

The admin container is a special host container that should be used sparingly for troubleshooting issues with the operating system or to install kernel modules (so long as kernel lockdown mode is set to none). When the admin container is enabled, an SSH key is required to log in.

Inside the container is a shell script, sheltie, which you can use to get a full root shell on the Bottlerocket host and access to the root file system. As a best practice, you should create a policy that prohibits the bulk of your containers from running as privileged or that try running with the control_t or super_t labels. On Kubernetes, OPA/Gatekeeper or Kyverno can be used as your policy enforcement mechanism. A library of policies for OPA/Gatekeeper, along with Kyverno, can be found in the Amazon EKS Best Practices Guide.

Example

Another way an attacker might try to circumvent the protections inside Bottlerocket is by disabling SELinux or turning off enforcement mode. In this example, imagine that the attacker has managed to escape from a privileged or superpowered container and tried running the following commands:

echo 1 > /sys/fs/selinux/disable
echo 1 > /sys/fs/selinux/enforce

Both of these commands will produce bash: echo: write error: Invalid argument, proving that SELinux cannot be disabled.

What’s next for SELinux?

In the near future, we will be adding support for Multi-Category Security (MCS), which will add yet another layer of protection. With MCS, each container will be assigned an MCS pair automatically. This prevents unprivileged containers from tampering with the processes, files, and so on, owned by another container running on the same host. This is an important feature because it helps to minimize the impact of a container escape where an attacker is able to gain access to the host file system. Learn more about this feature in the discussion on GitHub.

Ephemeral storage for configuration

With Bottlerocket, users cannot modify system configuration files such as /etc/resolv.conf or /etc/containerd/config.toml directly. On Bottlerocket, /etc is a tmpfs mount on Bottlerocket rather than a directory on a persisted filesystem. If an attacker were to modify the files in /etc, that their changes would be persisted when the system was rebooted is unlikely. Using a tmpfs volume, along with dm-verity, make it difficult for an attacker to persist changes to the file system that will survive a reboot.

When you do need to change the configuration of the system, you do so through the Bottlerocket API, which runs locally on the instance. The API is implemented as a Linux socket with an HTTP interface and is the primary way to read and modify operating system settings, to update services based on those settings, and to learn about and change the state of the operating system. These settings are persisted across reboot and migrated through operating system upgrades. They are used to render system configuration files from templates on every boot.

Configuration can also be applied by running additional containers. For example, on Kubernetes, configuration for networking and storage devices can be applied using containers like the CNI and CSI, which are deployed as node agents, a.k.a DaemonSets, and often run with additional privileges.

Example

Let’s look at what happens when an attacker manages to mount /etc on the host to /mnt/etc inside an unprivileged container. The file system type for /mnt/etc is tmpfs:

$ /mnt/etc# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 79G 4.6G 72G 6% /
tmpfs 64M 0 64M 0% /dev
tmpfs 3.8G 0 3.8G 0% /sys/fs/cgroup
tmpfs 3.8G 432K 3.8G 1% /mnt/etc
tmpfs 3.8G 4.0K 3.8G 1% /mnt/etc/cni
/dev/nvme1n1p1 79G 4.6G 72G 6% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 3.8G 12K 3.8G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 3.8G 0 3.8G 0% /proc/acpi
tmpfs 3.8G 0 3.8G 0% /sys/firmware

Notice that the etc_t label has been applied to the mount. This prevents privileged and unprivileged containers from modifying the files on that mount.

$ /mnt/etc/cni/net.d# mount | grep /mnt/etc
tmpfs on /mnt/etc type tmpfs (rw,nosuid,nodev,noexec,noatime,context=system_u:object_r:etc_t:s0,mode=755)
tmpfs on /mnt/etc/cni type tmpfs (rw,nosuid,nodev,noexec,noatime,seclabel)

Following is a line from the SELinux audit log showing that the write action was denied from a privileged container (control_t):

[2543663.669086] audit: type=1400 audit(1631152132.402:6): avc: denied { write } for pid=2272229 comm="touch" name="/" dev="tmpfs" ino=1 scontext=system_u:system_r:control_t:s0 tcontext=system_u:object_r:etc_t:s0 tclass=dir permissive=0

Again, if you need to modify the configuration of Bottlerocket, you should do so through the API or as part of the bootstrap process by referencing the settings TOML form in Amazon EC2 user-data.

Conclusion

Guarding against threats are a major concern for security practitioners. With Bottlerocket, AWS offers a secure, purpose-built operating system for running containers and can improve security in the cloud.

Although Bottlerocket can improve your security posture, we also strongly recommend having a layered approach to security, where controls are implemented at multiple layers in the stack. At AWS, we recommend following best practices for securing containerized environments. This includes preventing containers from running as privileged (or root), mounting host paths, running with host network, or host PID, or additional SELinux labels.

You may also want to consider implementing the default seccomp policy for your container runtime—for example, Docker (already enabled in the Amazon ECS variant) or Containerd.

For convenience, we’ve included a sample PSP with the previously recommendations in the GitHub repository for Bottlerocket. We have also published best practices guides for Amazon ECS and Amazon EKS, where you can learn about additional measures to implement to make environments even more secure.

We hope you’ll try using Bottlerocket for running your containerized workloads and explore its different security features.

If you have a suggestion or experience issues while using Bottlerocket, let us know by opening an issue in the GitHub repository.