By Ahmed Hany, Solutions Architect, ISV – AWS
By Teddy Schmitz, Solutions Architect, B2B Startups – AWS

As part of adopting a multi-tenant software-as-a-service (SaaS) model, a key challenge is how to provide strong tenant isolation in a cost effective and scalable manner. Being able to effectively isolate your tenants is an important part of a multi-tenant system.

A previous blog post by the AWS SaaS Factory team on dynamic policy generation introduced the mechanics of utilizing AWS Identity and Access Management (IAM) policies generated dynamically. In this post, we’ll look at how these policies get applied as part of the overall isolation story of your SaaS solution.

We will be using our reference implementation to demonstrate how to use dynamically generated policies in code. This incorporates a microservice running in AWS Lambda to retrieve tenant scoped products from Amazon DynamoDB.

We’ll also cover how to manage the policy templates, and how you can monitor the AWS Security Token Service (STS).

SaaS-Dynamic-Policy-Generation-1

Figure 1 – Sample tenant request flow.

Solution Building Blocks

Before we can dive into utilizing dynamically generated policies, let’s take a look at the key components of our SaaS environment.

In Figure 1 above, you can see we have a SaaS application that, in this case, is made up of a single serverless product microservice running in Lambda that’s accessed by multiple tenants. While we are showing a single microservice here, you can imagine how your own system would be made up of a collection of microservices that would use the same isolation mechanism we’ll apply to this one microservice.

The microservice shown here is storing and retrieving tenant-scoped data from a database, in this case DynamoDB. When called, the service returns all products that are associated with a tenant.

To maximize agility and operational efficiency, we want to run our Lambda functions with an execution role that allows it to be used for all tenants.

This broader scoping means our function requires some additional scoping to ensure it can only access a single tenant’s data for each request. This is achieved by acquiring a separate set of tenant-scoped credentials each time our microservice process a request to access data.

Tenant Request Flow

Step 1: Tenant Sends Request

At the beginning of a flow, you can see that a tenant sends a request to the Amazon API Gateway. Each request has a token that includes the context of the tenant making the request (acquired during authentication). This token in our sample solution is a JSON Web Token (JWT) with two custom claims: tenant_id and tenant_name.

Step 2: Microservice Processes Request

Once an authorized request is processed and routed by Amazon API Gateway, a call is made to your microservice (in this case a Lambda function).

For our example, let’s assume the incoming request needs to access data stored in a DynamoDB table that holds data for all tenants. Before this microservice can access the data, we need to acquire credentials that will use the supplied tenant context to scope access to the data.

This is where the Token Vending Machine (TVM) is used to acquire tenant-scoped credentials. To start this process, the microservice (Lambda) creates a new instance of the TVM which is included as a library in an AWS Lambda layer.

The example code below illustrates a call to the TVM. The first step is to construct an instance of the TVM; a call is then made to this instance, passing in the request headers that include the tenant context. The call returns a set of credentials that are scoped by the tenant.

TokenVendor tokenVendor = new TokenVendor();
final AwsCredentialsProvider awsCredentialsProvider = tokenVendor.vendTokenJwt(input.getHeaders());

You’ll notice this example encapsulates all the heavy lifting, moving it to a Lambda layer. This limits the amount of code a developer needs to create to acquire the tenant-scoped STS token.

Step 3: TVM Lookup of Dynamic Policies

If we look into the Lambda layer, when the TVM is instantiated it first checks to see if the correct version of the template policies are available locally. Policy templates are cached locally in the /tmp folder as best practice to enhance Lambda performance.

The policy version is specified by an environment variable, allowing you to easily control which templates are loaded at run time. If the templates are not found, they will be downloaded from Amazon Simple Storage Service (Amazon S3) and saved to the local filesystem.

The environment variable is managed by the same pipeline responsible for updating policy templates in S3. See the “Managing Policies” section later in the article for more details.

if(Files.notExists(templateFilePath)) { logger.info("Templates zip file not found, downloading from S3..."); S3Client s3 = S3Client.builder().httpClientBuilder(UrlConnectionHttpClient.builder()).build(); s3.getObject(GetObjectRequest.builder().bucket(TEMPLATE_BUCKET).key(TEMPLATE_KEY).build(). ResponseTransformer.toFile(templateFilePath)); try { ZipFile zipFile = new ZipFile(templateFilePath.toFile()); zipFile.extractAll(templateDirPath); logger.info("Templates zip file successfully unzipped."); } catch (IOException e) { logger.error("Could not unzip template file.", e); throw new RuntimeException(e.getMessage()); } } this.templateDir = new File(templateDirPath);

The code snippet above shows this in action. As policy templates will be cached locally in the /tmp folder, the code constructs the file path using the version and file name expected. If not found, it then attempts to download and extract the file from S3.

You will only need to download the policies from S3 once on a cold start, or when policy versions are changed. Otherwise, it will be cached locally for quick reuse on the next request.

This approach allows you to work on the templates outside the lifecycle of the microservice, separating the deployment of policy updates from any updates you might make to the code of your Lambda.

Step 4: TVM calls STS

The TVM layer hydrates tenant context to the templates, assembles them and make a call to STS injecting the dynamically generated policy as an inline policy when assuming the predefined role. Let’s see how that works in action.

public AwsCredentialsProvider vendTokenJwt(Map<String, String> headers) { Map<String, String> policyData = new HashMap<>(); policyData.put("table", DB_TABLE); FilePolicyGenerator policyGenerator = new FilePolicyGenerator(templateDir, policyData); JwtTokenVendor jwtTokenVendor = JwtTokenVendor.builder() .policyGenerator(policyGenerator) .durationSeconds(900) .headers(headers) .role(ROLE) .region(AWS_REGION) .build(); AwsCredentialsProvider awsCredentialsProvider = jwtTokenVendor.vendToken(); tenant = jwtTokenVendor.getTenant(); logger.info("Vending JWT security token for tenant {}", tenant); return awsCredentialsProvider;
}

For your reference, here is the link to TokenVendor.java in the Github repo.

First, the TVM gets any extra data it needs to substitute tenant-specific values in the policy templates. In this case, that is the DynamoDB table name which is the same as tenant_name extracted from the JWT token.

It then grabs the policies that are sitting on the local file system, and creates a TVM with all the necessary details to vend STS tokens from Step 3. The TVM will validate the JWT, inject the template variables, and submit the dynamically generated policy to STS.

It then returns the credentials provider you can provide to a service client from the AWS SDK. After you have the credentials, your microservice is free to make calls to DynamoDB to acquire data (Step 5).

Step 5: Access Data with Limited Scope

Back in the Lambda function, we can use the scoped AWS credentials to call DynamoDB to retrieve all products owned by a single tenant.

TenantProduct tenantProduct = new TenantProduct(awsCredentialsProvider, tenant); tenantProduct = tenantProduct.load(tenantProduct);

You can see we are creating a DynamoDB client using our new tenant-scoped credentials provided by the TVM layer. Now, any operation that is performed on our DynamoDB table will be constrained to the tenant making the request. Any request for data associated with another tenant, for example, would not return any results.

Managing Dynamic IAM Policies

Another integral part of the solution is how policy templates are going to be managed in terms of versioning, deployment, and caching.

Deployment and caching will allow the TVM to effectively use those templates in a production-ready environment. We are leveraging AWS CodeCommit and AWS CodePipeline for the continuous delivery of all policy templates to S3.

The basic lifecycle of the policy template can be summarized as follows:

  1.  All policy templates are grouped together in master branch of the CodeCommit repository. Each template defines the tenant isolation policies of each tenant resource. After a commit is approved for release, a new Git Tag should be created for this commit with the version number.
    .
  2. CodePipeline is configured to be triggered once a new Git Tag is created with prefix “release” to deploy the latest version to S3. In this step, a new object is created with the commit hash number written in the folder name, in the format templates/{GIT_COMMIT_HASH}/policies.zip.
    .
  3. CodePipeline will also update the TVM environment variable with the new commit hash number. TVM will use this variable value to access the right version of policy template. This way, you can simply roll back to any version with immediate effect on TVM usage of policies by updating this ENV VAR.

Next, let’s see how TVM determines the correct template to load in real-time and cache templates for fast response:

  1. Whenever the TVM receives a request from the microservice for a temporary tenant token, it will use an environment variable to get the commit hash and construct the relevant file path of the policy needed inside its /tmp folder (where templates are cached).
    .
  2. TVM will attempt to load the file from its static assets local cache in the /tmp folder using the constructed file path.
    .
  3. If file exists in the cache, it will be used to generate the token. Otherwise, the file will be loaded from S3 and cached.

This approach ensures the TVM is always using the correct policy version from S3 without compromising on performance.

Monitoring Isolation Activity

After you have introduced dynamic policies into your environment, some organizations may still be interested in monitoring and analyzing the application of these isolation policies. You’ll want to make sure you are utilizing policies at every layer of your SaaS architecture to ensure no cross-tenant access occurs.

This sample solution includes a few mechanisms that can help with this.

First, by utilizing a shared library in the form of a Lambda layer, or similar mechanism on other platforms, you can lower the risk of developers calling STS directly without a policy document. With a standardized way of requesting credentials, developers will have a standardized way to access tenant scoped credentials.

Additionally, having a robust code review mechanism for the permission templates and microservices helps mitigate this risk. Making sure templates are set up correctly to receive a tenant identifier and not allow broader permissions than expected, as well as making sure all microservices use the TVM before being deployed, is the best defense from stopping accidental granting of broad permissions.

While these strategies can help, you may also see value in introducing constructs that can bring more visibility to scenarios where tenant resources could be at risk for cross-tenant access.

This is where the STS watchdog comes into play. Whenever an operation in STS is performed, a corresponding event in AWS CloudTrail is generated. Combined with Amazon EventBridge, we can use this to watch for when a role is assumed.

In Figure 1 above, we have a Lambda function listening for these events, which we can then inspect to see if they are restricting permissions as expected. Below is the code the STS Watchdog uses to inspect events:

String eventName = event.getDetail().getEventName(); if(eventName.equals("AssumeRole")) { String policy = event.getDetail().getRequestParameters().getPolicy(); String roleArn = event.getDetail().getRequestParameters().getRoleArn(); logger.log("RoleArn: " + roleArn); logger.log("Policy: " + policy); if(policy == null) { // Publish a message to an Amazon SNS topic. final String msg = "A call to AssumeRoll was made without an inline policy."; PublishRequest publishRequest = PublishRequest.builder() .message(msg) .topicArn(topicArn) .build(); snsClient.publish(publishRequest); }
}

The watchdog is waiting for AssumeRole events. When it receives one, it checks the details to make sure an inline-policy was used. If there is no inline-policy, or if the provided string does not exist in the policy, it publishes a message to Amazon Simple Notification Service (SNS). This could then be forwarded on to an ops team to take further action, or to trigger downstream Lambdas to take remediation steps.

Upon receiving an alert, you could take advantage of the recently launched addition to AWS Detective to inspect the role and sessions. You can use this to decide what the best remediation steps to take would be.

You can find the watchdog on GitHub, and we welcome any pull requests to enhance the functionality of the Lambda.

Implementation Guidance

As part of introducing the TVM and dynamic policies to your SaaS architecture, there are some implementation strategies you may want to factor into your design.

These practices may enhance the performance and usability of your solution:

  • Reduce request latency by caching TVM-generated tokens for each tenant to each tenant resource. You can use cached tokens until they expire, and this can be controlled by the use of DurationSeconds parameter in the AssumeRole function in TVM.
    .
  • Use Git tag version numbers instead of the commit hash for a more readable value stating the current used policies version. This makes it easier when doing a rollback by just changing this number to the lower value directly.
    .
  • Currently, the solution loads the whole policies folder from S3. This can increase loading time when you have many templates inside the folder. This could be refined, loading only those policy templates needed at a given moment in time.

Conclusion

In this post, we showed you an end-to-end architecture, scaling the use of AWS IAM to govern any number of tenants’ access to AWS resources. This can be used by SaaS providers who are looking to scale and keep leveraging AWS access management and logical isolation as extra security layer for tenant requests.

You can try out the reference architecture for yourself. If you’d like to understand in more detail how a Token Vending Machine (TVM) works, please read Scaling SaaS Tenant Isolation with Dynamic Policy Generation.