Authored by Jon Stanley, Head of Systems of Method Melbourne. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
If you’ve seen a blockbuster film in the last five years, you’ve likely witnessed the visual effects and animation work done by the exceptionally talented artists at Method Studios in Melbourne. Our work spans film genres, from “Aquaman” and “Terminator: Dark Fate” to “Jumanji: The Next Level” and “John Wick: Chapter 3 – Parabellum,” and episodic series, including the Emmy Award-winning “Battle of the Bastards” episode in the sixth season of HBO’s “Game of Thrones.” Reflecting on our work, Disney’s “Christopher Robin” proved one of the more memorable undertakings, given the sheer feat of bringing the film’s honey-loving CG co-star, Winnie-the-Pooh, to life. Taking on this project not only resulted in an Asian Academy Creative Award win for the team, but it also marked the beginning of our collaboration with AWS.
Winning the bid for “Christopher Robin” in 2017 was thrilling, yet intimidating, as rendering a full-CG bear, often featured in closeup shots, is challenging. We’d been using Thinkbox Deadline software to manage our render farm for many years, before Thinkbox Software was acquired by AWS. As we began prepping for “Christopher Robin,” the AWS Thinkbox team assisted setting up a proof-of-concept workflow for burst rendering to the cloud. Before starting, we had to think through how we’d present our data to the cloud-based render nodes, where to store the data, and how to integrate cloud-based compute with our existing on-premises farm.
We considered a number of options in how to present textures and geometry to the Amazon Elastic Compute Cloud (Amazon EC2) instances. The simplest approach was to present our on-premises Network File System (NFS) server directly over our AWS Direct Connect to the Amazon EC2 Spot Fleet in the Sydney Region. That said, we wanted to take advantage of hundreds of instances that would create tens of gigabits of traffic. Getting that kind of throughput over a virtual private network (VPN) is challenging and cost prohibitive in terms of egress traffic. Instead, we opted for Amazon Elastic File System (Amazon EFS), a cloud-based, managed NFS service. Setup with Amazon EFS was simple, and the default configuration could be scaled back to minimize costs when unused. Though a less automated approach, we ended up using the provisioned throughput setting for more predictable performance.
In the years since “Christopher Robin,” the technical requirements for rendering have only grown more demanding, so we’ve continued to refine our workflow and how we use AWS services. Today projects require frames to be rendered at 4.5K resolution, which is where cloud-based compute access with lots of RAM and CPU cores pays off. Frames that might struggle to render with on-premises hardware can take advantage of those large Amazon EC2 instances, with all compute managed using the same Deadline instance. As Amazon EC2 workers come online, they’re dynamically available in Deadline Monitor alongside our on-premises farm.
One of the less obvious advantages we found while bursting into AWS is a reduced impact on on-premises storage. Input/output (I/O) requirements from heavy render activity can impact artists significantly, since it’s hard to stay productive when your storage can’t keep up. Moving that heavy IO workload to EFS has protected performance for our artists, and as a result, boosted productivity and morale around the studio. Alongside EFS, we’ve added an all-flash AWS Partner Qumulo cluster in order to speed up application start up times.
To ensure that the right data reaches the cloud for renders, and returns to artist workstations as soon as a frame finishes rendering, we built a system to programmatically generate a list of dependent assets and software required to render any given shot. Our toolset could already track dependencies, so it only took two weeks to build the basic infrastructure and software. Today, we use a database to track cloud storage files. Before transferring data, we check against that database to see if an asset needs to be synced. Anything not already in the database is added to the queue, which feeds a farm of sync workers that move data to/from cloud storage. Deadline captures any errors resulting from missing assets, which then automatically sends off a request to retrieve missing files. The failed task then waits for the data to arrive and resumes once the files have arrived in cloud storage.
Since we’re using the cloud to augment our on-premises render capacity, we needed to be able to scale out quickly without manual intervention, and just as easily scale down so we’re not paying for resources one second longer than we need them. To achieve this, we built tools to watch Deadline for queue tasks. Now when our on-premises render capacity is exhausted, we can easily scale up our Spot Fleet and begin rendering in the cloud within minutes. As soon as there are no more queued tasks, Deadline shuts down idle machines, so we only pay for compute in-use. We track and monitor spending using dashboards and cost allocation tags. This way we always have visibility into our spend, and a pre-set alert warns us if we’re exceeding the forecasted budget.
Occasionally our assets sync system might fail to copy something needed by a render, so we created a virtual workstation using an Amazon EC2 G4dn (GPU) instance that is set up exactly like an on-premises workstation. Anytime we need to troubleshoot a broken render, we connect to the cloud-based workstation and fire up a scene as if we’re running it in our studio to quickly identify any problems. When a render completes on AWS, the finished frame or data is synced back on-premises so that the artist can review it. To ensure renders are turning out as expected, we created a web application that allows artists to preview frames while they’re rendering. We take advantage of all the metrics coming out of Amazon CloudWatch and use Grafana dashboards to track Amazon EC2 metrics, storage, bandwidth, costs, and anything else we can imagine.
Now more than three years into our cloud journey, we’ve seen how integral rendering on AWS is to our render farm strategy. We’ve shifted focus away from capex purchases or hiring equipment, and are now focused on the cloud. At this point, we’ve automated most AWS infrastructure management and are spending far less time racking servers, managing firmware updates and replacing broken hardware. Our goal is to keep reducing the time it takes to get a finished frame back to artists, so that we can keep raising the bar in terms of quality and turnaround time for clients.
About the author: Jon Stanley is Head of Systems at Method Studios in Melbourne, a role he’s held since 2017. With nearly 20 years experience in systems engineering for VFX and post production studios, he has also spent time at Iloura, Lipsync Post, and Framestore.