Organizations are using Amazon Simple Storage Service (S3) for building their data lakes, websites, mobile applications, and enterprise applications. As the number of objects within your S3 bucket increases, you may want to move older objects into lower-cost tiers of Amazon S3. In some cases you may want to delete the objects altogether to further reduce S3 storage costs. A common practice is to use S3 Lifecycle rules to achieve this. These rules can be applied to objects based on their creation date. In certain situations, you may want to keep objects available that are still being accessed, but transition or delete objects that are no longer in use.

In this post, we will demonstrate how you can create custom object expiry rules for Amazon S3 based on the last accessed date of the object. We will first walk through the various features used within the workflow, followed by an architecture diagram outlining the process flow.

Amazon S3 server access logging

S3 Server access logging provides detailed records of the requests that are made to objects in Amazon S3 buckets. Amazon S3 periodically collects access log records, consolidates the records in log files, and then uploads log files to your target bucket as log objects. Each log record consists of information such as bucket name, the operation in the request, and the time at which the request was received. S3 Server Access Log Format provides more details about the format of the log file.

Amazon S3 inventory

Amazon S3 inventory provides a list of your objects and the corresponding metadata on a daily or weekly basis, for an S3 bucket or a shared prefix. The inventory lists are stored as a comma-separated value (CSV) file compressed with GZIP, as an Apache optimized row columnar (ORC) file compressed with ZLIB, or as an Apache Parquet file compressed with Snappy.

Amazon S3 Lifecycle

Amazon S3 Lifecycle policies help you manage your objects through two types of actions, Transition and Expiration. In the architecture shown following in Figure 1, we create an S3 Lifecycle configuration rule that expires objects after ‘x’ days. It has a filter for an object tag of “delete=True”. You can configure the value of ‘x’ based on your requirements.

If you are using an S3 bucket to store short lived objects with unknown access patterns, you might want to keep the objects that are still being accessed, but delete the rest. This will let you retain objects in your S3 bucket even after their expiry date as per the S3 lifecycle rules, while saving you costs by deleting objects that are not needed anymore. The following diagram shows an architecture that considers the last accessed date of the object before deleting S3 objects.

Figure 1. Object expiry architecture flow

Figure 1. Object expiry architecture flow

This architecture uses native S3 features mentioned earlier in combination with other AWS services to achieve the desired outcome.

Here is the architecture flow:

  1. The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
  2. An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
  3. An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
  4. The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
    • Capture the number of days (x) configuration from the S3 Lifecycle configuration.
    • Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than ‘x’ days, but not accessed during that time.
    • Write a manifest file with the list of these objects to an S3 bucket.
    • Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of “delete=True”.
  5. The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to ‘x’ days. They will have the tag given via the S3 batch operation of “delete=True”.

The preceding architecture is built for fault tolerance. If a particular run fails, all the objects that must be expired will be picked up during the next run. You can configure error handling and automatic retries in your Lambda function. An Amazon Simple Notification Service (SNS) topic will send out a notification in the event of a failure.

Cost considerations

S3 server access logs, S3 inventory lists, and manifest files can accumulate many objects over time. We recommend you configure an S3 Lifecycle policy on the target bucket to periodically delete older objects. Although following the guidelines in this post can decrease some of your costs, S3 requests, S3 inventory, S3 Object Tagging, and Lifecycle transitions also have costs associated with them. Additional details can be found on the S3 pricing page.

Amazon Athena charges you based on the amount of data scanned by each query. But Amazon S3 inventory can also output files in Apache ORC or Apache Parquet format, which can reduce the amount data scanned by Athena. The Athena pricing page would be helpful to review.

AWS Lambda has a free usage tier of 1M free requests per month and 400,000 GB-seconds of compute time per month. However, you are charged based on the number of requests, the amount of memory allocated, and the runtime duration of the function. See more at the Lambda pricing page.

Conclusion

In this blog post, we showed how you can create a custom process to delete objects from your S3 bucket based on the last time the object was accessed. You can use this architecture to customize your object transitions, clean up your S3 buckets for any unnecessary objects, and keep your S3 buckets cost-effective. This architecture can also be used on versioned S3 buckets with some minor modifications.

We hope you found this blog post useful and welcome your feedback!

Read more about queries, rules, and tags:

Categories: Architecture