AWS provides powerful services to analyze and transform videos, images, sounds, speech, and text; however, to use those services you have to use general purpose AWS APIs, like AWS Lambda, AWS Step Functions, Amazon S3, and more, as building blocks. My colleagues and I believe multimedia application developers are better served by an API purpose-built for multimedia, so we built one. This blog explores that framework, the Media Insights Engine, which aims to help developers focus less on the scaffolding and more on the things that make their applications impactful.
My colleagues at AWS Elemental deal with all aspects of video, having built applications for video streaming, media asset management, video content analysis, and more. Through our combined experiences, my team has learned which design patterns are common in multimedia applications, such as:
- How to define video pipelines from a series of serverless functions.
- How to catalog multimedia data like binary audio, time series video captions, and plaintext metadata together in a unified storage system.
- How to build analytical web applications that can handle the data intensity of full-length movies.
Rather than implement these patterns for every new application, we envisioned starting from a foundational framework that included the building blocks needed for analyzing or transforming media files, and in early 2019, began developing such a framework called the Media Insights Engine (MIE). Collaborators were invited to contribute to the project on GitHub. In the past six months, MIE has been used as a foundation for more than 30 new applications, including one impactful application that generates multilingual subtitles for an organization broadcasting critical information to millions of global citizens.
MIE: a modular framework for video processing in AWS
The Media Insights Engine (MIE) is a highly configurable, stage-based, serverless media processing framework. MIE enables developers to create workflows composed of operators that perform tasks on media in a modular, extensible manner. These operators can perform a wide range of tasks, from simple video transcoding to multifaceted computer vision analysis, all within a single workflow defined in Step Functions. MIE handles workflow orchestration, multimodal data storage, and infrastructure scaling in an efficient manner optimized to reduce cost. This allows builders to focus more time on creating business value and less time on solving common operational challenges.
MIE follows the design principle of Unix, which emphasizes simple modular parts being connected by clean interfaces. In MIE, the simple modular parts are operators that analyze or transform multimedia objects, and the interfaces are APIs that orchestrate workflow pipelines and data persistence. This design succinctly encompasses the following three elements commonly needed for building multimedia applications:
- Operators: modular functions that perform discrete tasks to transform or analyze a media object
- Pipelines: a sequence of operators that work together to derive new media objects or new metadata about a media object
- Persistence: a storage system for storing or retrieving multimodal data like binary media objects and plain text metadata
When developers can access pipelines and data persistence with purpose-built APIs, they can focus on implementing applications without worrying about low-level APIs for orchestrating pipelines and accessing multimodal data. This separation of concerns leads to accelerated development time for multimedia applications and higher quality for pipeline orchestration and data persistence layers.
I have seen the benefits of MIE play out in several instances. They tend to be video-on-demand applications designed to analyze videos uploaded by users. The following are a few examples of applications that have been built using MIE.
Transforming video content with redaction
One of my colleagues created a face redaction application with the following three operators:
- an Amazon Rekognition operator that identified faces in a video
- an OpenCV operator that extracted and blurred an image for each video frame identified by Amazon Rekognition
- a stitching operator that replaced video frames in the original video with the images blurred by OpenCV
After a fourth operator was added to redact weapons, nudity, and violence identified by Amazon Rekognition content moderation, the processing pipeline looked like this:
Deriving video features for ad placement
A common type of applications developers build using MIE involves combining data types to learn what is happening in a video. For example, another colleague of mine created an ad placement application that detected scene changes by combining pixel data from an OpenCV operator with metadata from a pair of AWS Rekognition operators designed to count objects and people per frame. This worked well because scene changes frequently coincide with large changes in metadata, such as:
- the quantity of in-frame objects,
- the quantity of in-frame people,
- and the value of frame brightness as scenes fade to black.
Once all the scene changes in a video were identified, another operator generated an ad placement file (VMAP) that could be imported into online advertising systems. The following graphic visualizes the pipeline for this application:
I like how the ad placement application demonstrates the flexibility of operators in MIE. They can be used to derive new metadata, like scene changes, or new media objects, like VMAP files. The operator abstraction in MIE also allows you to use traditional video processing utilities, like ffmpeg or OpenCV, as well as contemporary machine learning services like AWS Rekognition.
Evaluating off-the-shelf AI with your own content
The market for pre-trained AI services is growing rapidly. For example, Amazon currently offers no fewer than 12 top-level AI services that can be used without prior machine learning experience. The first questions customers ask about these services are about whether they provide an acceptable level of accuracy for their own content:
- Does AWS Rekognition provide labels for the objects I’m looking for?
- Can AWS Transcribe accurately parse the terminology used in my videos?
- Can AWS Translate accurately translate my videos to the regional languages of my audience?
My team built the AWS Content Analysis solution using MIE so developers can test-drive AWS AI services with their own content. This application, which is included as a reference application within the open source MIE framework, uses several AWS services to catalog videos with computer vision and speech detection data.
Content analysis for automated translations
One of the most impactful use cases for MIE has involved leveraging the AWS Content Analysis application for a global organization that needed to accelerate the production of public service announcements in multiple languages for a global audience. By leveraging the modularity for operators in MIE, my team quickly delivered an application for automated translation by extending the translate and transcribe operators in the content analysis application. This in turn helped stakeholders learn which languages perform well and which ones still require manual translation. Without MIE, it would have taken them much longer to understand how automated translation on AWS performs with their own content and over a wide variety of languages.
MIE is a software development toolkit for applications that analyze or transform videos on AWS. MIE can accelerate the development time and reduce the learning curve for coding cost effective video processing applications. By using MIE, developers can focus on the things that generate business value, like algorithms and data visualization.
If you plan on building applications that analyze or transform videos on AWS, consider Media Insights Engine. It may save you a lot of time and help your application scale.