Multiple ways of invoking Amazon EMR

Photo by Luca Bravo on Unsplash

Amazon EMR (Elastic Map Reduce) has become the ‘go-to’ tool for Data Scientist and enthusiast alike who want to get hold of computing power with a convenience of ‘pay as you go’.

Advent of ‘Serverless Architecture’ and the need to spin up an on-demand ‘Spark’ compute engine has meant that EMR is one of the most used services on AWS.With an ability to handle a wide range of use cases, Amazon EMR brings reliability, security and cost effectiveness.

This post is however not about ‘What is EMR?’, there are plenty of documents, blogs on that topic that one can lay hands on, this post is about the three options that can be used to create, access & terminate an EMR cluster.

Option 1 : Lambda + boto3 = Simplicity

Python boto3 library provides a convenient way to create transient EMR clusters,add ‘steps’ to existing clusters.

Example below uses ‘add_job_flow_steps’ to add a HadoopJarStep to an existing cluster. add_job_flow_steps has the following syntax

response = client.add_job_flow_steps(
JobFlowId='string',
Steps=[
{
'Name': 'string',
'ActionOnFailure': 'TERMINATE_JOB_FLOW'|'TERMINATE_CLUSTER'|'CANCEL_AND_WAIT'|'CONTINUE',
'HadoopJarStep': {
'Properties': [
{
'Key': 'string',
'Value': 'string'
},
],
'Jar': 'string',
'MainClass': 'string',
'Args': [
'string',
]
}
},
]
)

Multiple steps (256 max) can be added to a job flow.

Pros

This is the most simple, requiring minimal knowledge of how EMR functions. User can set this up and get this going in a matter of minutes and can be easily embedded in any existing data flow pipeline

Cons

Tracking progress and eventual status of the job(s) submitted on EMR is a hassle, added to it is the fact that Labmda Functions have a 15 mins run limit

Option 2: Livy+Lambda+SFN = Transparency

This option uses a combination of Livy, Lambda & SFN for submitting, tracking jobs on Amazon EMR.

Apache Livy is a service that enables easy interaction with EMR cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, Synchronous or asynchronous result retrieval. Futher details on Livy is a topic for a future post

Process Flow

Amazon Step Function SFN based State Machine is used to orchestrate submission of job submission to EMR via Apache Livy. The approach uses two Lambda Functions (one each for submission and status tracking)

State Machine

The Submit Job Lambda step invokes a Lambda function (described in detail in a later post) to submit a job to EMR via Livy

The Track Status Lambda invokes another Lambda to track the job submitted and returns the current status of the job. This process continues till a ‘success’ or ‘failure’ is reached

Pros

This method provides the flexibility of Livy REST API to submit and track jobs.A near real time status of the jobs can be tracked and logged.The 15 min job run restriction does not impact this approach as first lambda submits the job and exits while the tracking lambda is invoked repeatedly over the lifetime of a job.

Cons

Needs more code to be built as well as knowledge of SFN is required.Livy does job tracking through integer job IDs, this could lead to mix-ups if Livy shuts down, and restarts without recovery being set, as the job IDs reset to 0 again.

Option 3 : Pure SFN = Versatility

Launched late 2019, this option is the newest of the three, integrating AWS Step Functions with AWS EMR.This option allows creation, termination of EMR Clusters. Further steps can be added, executed in parallel all with the versatility to AWS Step Function.

SFN with EMR

With this approach the only thing the user is need to know about is about creating a State Machine using Step Functions. I will be following up this post on a more detailed post on using this option especially as to how it seamlessly integrates with Serverless Data Pipelines and brings on the table the versatility of SFM based State Machines. AWS Documentation can be a good place to start.

Pros

Pure SFN based meaning no additional services need to be configured to access EMR.

Cons

Creating a State Machine may prove tricky at times specially when the number of steps grows.

Conclusion

With multiple options available the question often is “What to use and when?”The answer lies in the use case. While implementing a Serverless Data Pipeline on AWS, I have often found option 2 to be the most sought after as it provides the ability to track and trace a job. Option 1 is often used in test scenarios where teams need a sandbox to quickly test Spark jobs on EMR.Option 3 is new and is gaining popularity although it may be tricky at first to create a State Machine

--

--

--

A Cloud Architect with IBM, I am an avid reader and programming enthusiast.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

AWS with Terraform and Jenkins Pipeline

Using Alibaba Cloud’s Network Attached Storage (NAS) with Function Compute

Three types of load balancers in AWS

AWS — EC2 and Its Pricing Options

Cisco DNA Center Release 2.2.2.6 with ISE

Customer segmentation — Part I

Database Object Naming Standards

Gary and multiplication

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Shantanu Acharya

Shantanu Acharya

A Cloud Architect with IBM, I am an avid reader and programming enthusiast.

More from Medium

Using Athena Views As A Source In Glue

Migrating Data Between Amazon Redshift Databases.

Migrate data from one schema/table to another in Amazon Redshift

Amazon Redshift