At AWS re:Invent 2020, we announced the preview of Amazon EMR Studio, an integrated development environment (IDE) that makes it easy for data scientists and data engineers to develop, visualize, and debug applications written in R, Python, Scala, and PySpark. Today, we’re excited to announce the general availability of EMR Studio and new features we’ve added since the preview, including the ability to use the Amazon EMR console and AWS CloudFormation to create and configure a new EMR Studio for your team, support for Microsoft Active Directory (AD) as an identity provider, a new quick start notebook experience, the ability to launch the live Apache Spark UI directly from an EMR Studio notebook, and support for private Git repositories.

EMR Studio provides fully managed Jupyter notebooks, and tools like Spark UI and YARN Timeline Service to simplify debugging. EMR Studio uses AWS Single Sign-On and allows you to log in directly with your corporate credentials without signing in to the AWS Management Console. You can install custom kernels and libraries, collaborate with peers using code repositories such as GitHub and Bitbucket, and run parameterized notebooks as part of scheduled workflows using orchestration services like Apache Airflow and Amazon Managed Workflows for Apache Airflow (Amazon MWAA).

With EMR Studio, you can run notebook code on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS), and take advantage of the performance-optimized EMR runtime for Apache Spark. You can set up EMR Studio to run applications on existing EMR clusters or create new clusters using Cloud Formation templates for Amazon EMR.

Several customers participated in the EMR Studio preview, including Mapbox, which provides a mapping and location cloud platform for developers.

“Mapbox provides precise location data and developer tools to change the way we navigate the world,” says said Saba El-Hilo, Head of Data Platform, Mapbox. “EMR Studio allows us to prototype Spark applications and data science models that power large-scale data processing and transformations. The integrated development environment makes it easy for data scientists and engineers to perform ad hoc analysis and debug data processing workloads.”

New EMR Studio features

We’ve added new features based on feedback from preview customers to simplify both configuration and application development with EMR Studio.

Now you can use the EMR console, AWS CloudFormation, or the AWS Command Line Interface (AWS CLI) to create a new EMR Studio for your team. You can use the guided steps on the Amazon EMR console to easily set up security features and access control, and assign users or groups to an EMR Studio. You can also view Studio configurations and delete Studios in the UI. You can automate Studio creation in AWS CloudFormation by specifying the configurations and dependencies in a CloudFormation template. In addition, we’ve added support for Microsoft AD as an identity source that you can use with EMR Studio via AWS SSO.

We also made it more flexible for administrators to create cluster templates. Now you can specify parameters that users can set when they create clusters using your template. Like in the preview, you can also provide multiple cluster variations with a fixed set of parameters if you prefer.

We’ve added new sample notebooks that make it easier to start building data science applications in EMR Studio. You can use samples such as PySpark code querying a Hive metastore and Python code for visualization for a quick start, create copies of the notebooks in your EMR Studio workspace, run them as is, or edit them to meet your unique needs. For more information and a list of EMR Studio sample notebooks, see Configure a Workspace for EMR Studio.

We’ve extended the collaboration features of EMR Studio to include connecting from notebooks in EMR Studio to GitHub, Bitbucket, GitLab, and AWS CodeCommit repositories on private networks such as on-premises and customer VPCs. During the preview, you could only connect to repositories on public networks.

Finally, we’ve made application debugging easier by enabling you to launch the live Apache Spark UI directly from notebooks within EMR Studio. During the preview, you had to leave the notebook in EMR Studio, locate the application of interest on the cluster, and launch the Spark History Server. Now you can access logs and debug your application without leaving the notebook interface in EMR Studio.

Get started with EMR Studio

If you already use Amazon EMR, check out the tutorial Getting Started with the Amazon EMR Studio Interface.

If you’re running Apache Spark and other big data applications on premises or self-hosting them, learn about migrating to Amazon EMR in the Self-Service EMR Migration Guide and create a migration plan for your organization in a free workshop with Amazon EMR specialists.


About the Author

Shuang LiShuang Li is a Senior Product Manager for Amazon EMR at AWS. She holds a doctoral degree in Computer Science and Engineering from Ohio State University.