Multiple ways of invoking Amazon EMR

Amazon EMR (Elastic Map Reduce) has become the ‘go-to’ tool for Data Scientist and enthusiast alike who want to get hold of computing power with a convenience of ‘pay as you go’.
Advent of ‘Serverless Architecture’ and the need to spin up an on-demand ‘Spark’ compute engine has meant that EMR is one of the most used services on AWS.With an ability to handle a wide range of use cases, Amazon EMR brings reliability, security and cost effectiveness.
This post is however not about ‘What is EMR?’, there are plenty of documents, blogs on that topic that one can lay hands on, this post is about the three options that can be used to create, access & terminate an EMR cluster.
Option 1 : Lambda + boto3 = Simplicity
Python boto3 library provides a convenient way to create transient EMR clusters,add ‘steps’ to existing clusters.
Example below uses ‘add_job_flow_steps’ to add a HadoopJarStep to an existing cluster. add_job_flow_steps has the following syntax
response = client.add_job_flow_steps(
JobFlowId='string',
Steps=[
{
'Name': 'string',
'ActionOnFailure': 'TERMINATE_JOB_FLOW'|'TERMINATE_CLUSTER'|'CANCEL_AND_WAIT'|'CONTINUE',
'HadoopJarStep': {
'Properties': [
{
'Key': 'string',
'Value': 'string'
},
],
'Jar': 'string',
'MainClass': 'string',
'Args': [
'string',
]
}
},
]
)
Multiple steps (256 max) can be added to a job flow.
Pros
This is the most simple, requiring minimal knowledge of how EMR functions. User can set this up and get this going in a matter of minutes and can be easily embedded in any existing data flow pipeline
Cons
Tracking progress and eventual status of the job(s) submitted on EMR is a hassle, added to it is the fact that Labmda Functions have a 15 mins run limit
Option 2: Livy+Lambda+SFN = Transparency
This option uses a combination of Livy, Lambda & SFN for submitting, tracking jobs on Amazon EMR.
Apache Livy is a service that enables easy interaction with EMR cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, Synchronous or asynchronous result retrieval. Futher details on Livy is a topic for a future post

Amazon Step Function SFN based State Machine is used to orchestrate submission of job submission to EMR via Apache Livy. The approach uses two Lambda Functions (one each for submission and status tracking)

The Submit Job Lambda step invokes a Lambda function (described in detail in a later post) to submit a job to EMR via Livy
The Track Status Lambda invokes another Lambda to track the job submitted and returns the current status of the job. This process continues till a ‘success’ or ‘failure’ is reached
Pros
This method provides the flexibility of Livy REST API to submit and track jobs.A near real time status of the jobs can be tracked and logged.The 15 min job run restriction does not impact this approach as first lambda submits the job and exits while the tracking lambda is invoked repeatedly over the lifetime of a job.
Cons
Needs more code to be built as well as knowledge of SFN is required.Livy does job tracking through integer job IDs, this could lead to mix-ups if Livy shuts down, and restarts without recovery being set, as the job IDs reset to 0 again.
Option 3 : Pure SFN = Versatility
Launched late 2019, this option is the newest of the three, integrating AWS Step Functions with AWS EMR.This option allows creation, termination of EMR Clusters. Further steps can be added, executed in parallel all with the versatility to AWS Step Function.

With this approach the only thing the user is need to know about is about creating a State Machine using Step Functions. I will be following up this post on a more detailed post on using this option especially as to how it seamlessly integrates with Serverless Data Pipelines and brings on the table the versatility of SFM based State Machines. AWS Documentation can be a good place to start.
Pros
Pure SFN based meaning no additional services need to be configured to access EMR.
Cons
Creating a State Machine may prove tricky at times specially when the number of steps grows.
Conclusion
With multiple options available the question often is “What to use and when?”The answer lies in the use case. While implementing a Serverless Data Pipeline on AWS, I have often found option 2 to be the most sought after as it provides the ability to track and trace a job. Option 1 is often used in test scenarios where teams need a sandbox to quickly test Spark jobs on EMR.Option 3 is new and is gaining popularity although it may be tricky at first to create a State Machine