BigData

Productionizing Apache Spark (Data Pipelines)

Ahmed Kamal

21 Jul 2016 — 4 min read

Apache Spark On Production (for Data Pipelines)

This is the second post about Running Spark On Production, you can read the first post from here

In the first post, we talked briefly about spark and then discussed the data exploration use case and compared between the available different tools . In this post, we would discuss the second use which is building data pipelines.

2- Production Spark Pipelines Use Case:

Due to Spark 1.6.0 documentation the only provided way to submit Spark jobs is through Spark-Submit script.

You have to bundle your application’s dependencies and pass it to Spark-Submit to run it.

Depending on Spark Submit can be a challenging task for different reasons. Generally, having a REST API for jobs submission should:

1- Give us a single shared endpoint for any Spark user without the need to setup Spark on every single client machine.

2- Allocate the Spark driver on the machine having the REST API server, not the one you are submitting from. (This applies to client mode where the driver process is created in the machine submitting the job.)

3- Allow retaining a shared context between different jobs which is very useful for low-latency jobs.

4- Allow interacting with Spark from different languages.

5- Make job tracking and monitoring much easier.

I am just trying to show the possibilities that the REST API option would enable. The availability of specific features at the end depends on the API implementation.

Currently, there are three available options when it comes to using Spark through an API/job server:

1- Ooyala ‘s Spark Job Server

Spark Job Server describes itself as “a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts”.

Some notes about Spark Job Server (sometimes we call it SJS).

Spark Job Server is not part of the official Apache Spark repo and there are no plans to include it.
Deploying SJS can somehow be annoying, as there aren’t any explicit instructions for the steps you need to follow, especially with Cloudera’s Spark version, due to some dependencies conflicts. This makes upgrading Spark more difficult and time consuming.
There isn’t a High Availability setup available for SJS until now so you may consider making some efforts to guarantee its availability.
Based on our experience using one shared context in SJS
We had faced issues with high load of hundreds of concurrent jobs. So the team here developed a queuing layer between our engine and SJS to handle this issue. So you may need to consider having queuing layer if you want to use it.
We had to reset the context sometime just to make sure that the resources are freed especially when a problem happens to one of the running jobs.

2- Spark Internal REST API

Spark Internal REST API was developed with the intent to provide a standard endpoint for Spark jobs submission in standalone cluster mode. It was later extended to support Mesos Cluster mode too. It is available since Spark 1.3+ and the REST API url is provided in the Spark master web interface as shown.

It supports submitting (Submit and Kill) and getting status of running jobs although it is missing the error details of any exception occurring inside the driver in case of job failure like what SJS provides.

There is a java client developed by the community for easier usage in addition to a jenkins plugin based on it.

The main aim for the REST API development was to be an internal endpoint for job submission between Spark Submit and Spark Master to facilitate maintaining backward and forward compatibility but there is an open issue (during the writing of this post) requesting to publicize the API and to put it in the official documentation.

I’ve benchmarked the REST API through the submission of 1000 concurrent simple Spark Jobs and it was very stable and completed all of them successfully.

3- Cloudera’s Livy

Cloudera has developed a job server too after they found that the available options doesn’t satisfy Cloudera’s Hue needs. Livy is built with the idea of interactive execution in mind (like the Spark Notebook mentioned earlier). Basically, it supports two modes of execution

1- Interactive Execution

2- Batch Execution

Interactive Execution	Batch Execution
Is based on Spark Different Shells	Offers a wrapper around spark-submit
Supports Scala, Python and R	Supports Java, Scala (Packaged in Jars) and Python

Livy is written in Scala. It was a part of Hue Project but it is now a separate project.

Livy supports two deployment modes :

Local
YARN

Comparison Between SJS, Spark Internal REST API and Cloudera’s Livy

Spark Job Server	Internal REST API	Cloudera Livy
A lot of configurations/adjustments needed before cluster deployment especially with modified Spark releases like in Cloudera.	Out of the box support (Forward and Backward compatibility starting from Spark 1.3+)	Some configurations is needed especially if you want to run it with modes other than YARN
Provides Error message of failed jobs	Just reports the failure status.	Just reports the failure status.
Supports Sharing One Context between different jobs	Only One context per job.	Supports Sharing One Context between different jobs
Spark Jobs’ code has to implement a certain interface defined by SJS to be suitable for submission.	No modifications required	No modifications required
Standalone, YARN and Mesos modes (Client Modes) are supported although configuration is required.	Standalone and Mesos (Cluster Modes) are supported out of the box.	YARN and local modes
No HA arrangements are available.	Depends on the Spark master availability.	No HA arrangements are available.
Sharing RDDs among different jobs in the same context as well as Named RDD support.	Not available although Alluxio (formerly Tachyon) can be used for similar results.	Sharing RDDs among different users and contexts
Scala and Java Jobs are supported	Scala and Java Jobs are supported	Scala, Java and Python are supported

Spark Jobs as a part of a workflow

Running your jobs as a part of a bigger workflow is important. Although there may be a lot of option to make design your data workflows, Apache Oozie is one of the most used frameworks in this area.

Let’s how to run Spark as a part of an Oozie workflow ?

Oozie Spark Node

Oozie can handle the Spark jobs management through its Oozie Spark node. The node is basically using Spark Submit script behind the scenes to initiate Spark jobs either on On Yarn or locally. Spark Jobs are treated like any other hadoop map reduce job inside oozie.

In case you are using any of pre-mentioned APIs and don’t want to use the oozie provided node, you can create a custom oozie node to fit your needs. While using Spark Job Server we developed our own Oozie Spark Node and it worked perfectly (We may consider open sourcing it if we found a need from the community).

References :

http://arturmkrtchyan.com/apache-Spark-hidden-rest-api

http://gethue.com/spark/

Productionizing Apache Spark (Data Pipelines)

Ahmed Kamal

Apache Spark On Production (for Data Pipelines)

2- Production Spark Pipelines Use Case:

Read more

How to Effectively Manage Teams as Your Organization Scales?

Books to read in 2021

The virtual workspace, how AI is shaping the future of remote work?

The remote-work startups' wave