Run Spark Jobs with Oozie

Prerequisites

IMPORTANT This component is deprecated. Hewlett Packard Enterprise recommends using an alternate product. For more information, see Discontinued Ecosystem Components.

Complete the following steps to configure Oozie to run Spark jobs:

Configure a Spark action:

Procedure

  1. For running a Spark action through Oozie, you should be able to connect to Hive on a secure cluster. Make sure the hive-site.xml file that is used by Oozie has the following property set:
    <property>
      <name>hive.metastore.sasl.enabled</name>
      <value>true</value>
    </property>        
  2. To add Spark configuration files (spark-defaults.conf, hive-site.xml, etc) to a Spark action, copy the files to the {OOZIE_HOME}/share/lib/spark/ directory.
  3. If needed, update the Oozie shared libraries as described in Updating the Oozie Shared Libraries.
  4. Run pySpark using Spark Action:
    1. Run pySpark using Spark Action by specifying pyspark and py4j zip files to the sharelib:
      cp /{SPARK_HOME}/python/lib/ pyspark*.zip {OOZIE_HOME}/share/lib/spark/
      cp /{SPARK_HOME}/python/lib/py4j*src.zip {OOZIE_HOME}/share/lib/spark/
    2. Update the Oozie shared libraries as described in Updating the Oozie Shared Libraries.
  5. When you configure a Spark action in the workflow.xml, specify the master and mode elements of the Spark job:
    • For Spark standalone mode, specify the Spark Master URL in the master element. For example, if your SparkMaster URL is spark://ubuntu2:7077, you would replace the <master> [SPARK MASTER URL]</master> in the example below with <master> spark://ubuntu2:7077</master>.
    • For Spark on YARN mode, specify yarn-client or yarn-cluster in the master element. For example, for yarn-cluster mode, you would replace <master> [SPARK MASTER URL]</master> with <master>yarn</master> and <mode>[SPARK MODE]</mode> with <mode>cluster</mode>.

      Here is an example of a Spark action within a workflow.xml file:

      <workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'>
        <start to='spark-node' />
        <action name='spark-node'>
           <spark xmlns="uri:oozie:spark-action:0.1">
              <job-tracker>${jobTracker}</job-tracker>
              <name-node>${nameNode}</name-node>
              <master>[SPARK MASTER URL]</master>
              <mode>[SPARK MODE]</mode>
              <name>Spark-FileCopy</name>
              <class>org.apache.oozie.example.SparkFileCopy</class>
              <jar>${nameNode}/user/${wf:user()}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>
              <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/input-data/text/data.txt</arg>
              <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/output</arg>
            </spark>
            <ok to="end" />
            <error to="fail" />
         </action>
         <kill name="fail">
            <message>Workflow failed, error
                     message[${wf:errorMessage(wf:lastErrorNode())}]
            </message>
         </kill>
        <end name='end' />
      </workflow-app>