Run Spark Jobs with Oozie
Prerequisites
IMPORTANT This component is deprecated. Hewlett Packard
Enterprise recommends using an alternate product. For more information, see Discontinued Ecosystem Components.
Complete the following steps to configure Oozie to run Spark jobs:
Configure a Spark action:
Procedure
-
For running a Spark action through Oozie, you should be able to connect to Hive on a
secure cluster. Make sure the
hive-site.xml
file that is used by Oozie has the following property set:<property> <name>hive.metastore.sasl.enabled</name> <value>true</value> </property>
-
To add Spark configuration files (
spark-defaults.conf
,hive-site.xml
, etc) to a Spark action, copy the files to the{OOZIE_HOME}/share/lib/spark/
directory. - If needed, update the Oozie shared libraries as described in Updating the Oozie Shared Libraries.
-
Run pySpark using Spark Action:
- Run pySpark using Spark Action by specifying pyspark and py4j zip files to the
sharelib:
cp /{SPARK_HOME}/python/lib/ pyspark*.zip {OOZIE_HOME}/share/lib/spark/ cp /{SPARK_HOME}/python/lib/py4j*src.zip {OOZIE_HOME}/share/lib/spark/
- Update the Oozie shared libraries as described in Updating the Oozie Shared Libraries.
- Run pySpark using Spark Action by specifying pyspark and py4j zip files to the
sharelib:
-
When you configure a Spark action in the
workflow.xml
, specify themaster
andmode
elements of the Spark job:- For Spark standalone mode, specify the Spark Master URL in the
master
element. For example, if your SparkMaster URL isspark://ubuntu2:7077
, you would replace the<master> [SPARK MASTER URL]</master>
in the example below with<master> spark://ubuntu2:7077</master>
. - For Spark on YARN mode, specify
yarn-client
oryarn-cluster
in themaster
element. For example, foryarn-cluster
mode, you would replace<master> [SPARK MASTER URL]</master>
with<master>yarn</master>
and<mode>[SPARK MODE]</mode>
with<mode>cluster</mode>
.Here is an example of a Spark action within a
workflow.xml
file:<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'> <start to='spark-node' /> <action name='spark-node'> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <master>[SPARK MASTER URL]</master> <mode>[SPARK MODE]</mode> <name>Spark-FileCopy</name> <class>org.apache.oozie.example.SparkFileCopy</class> <jar>${nameNode}/user/${wf:user()}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar> <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/input-data/text/data.txt</arg> <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/output</arg> </spark> <ok to="end" /> <error to="fail" /> </action> <kill name="fail"> <message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}] </message> </kill> <end name='end' /> </workflow-app>
- For Spark standalone mode, specify the Spark Master URL in the