Configuring the Spark Interpreter
The Spark interpreter is available starting in the 1.1 release of the MapR Data Science Refinery. It provides support for Spark Python, SparkR, Basic Spark, and Spark SQL jobs. To use the Spark interpreter for these variations of Spark, you must take certain actions, including configuring Zeppelin and installing software on your MapR cluster.
You must also issue your docker run
command with the parameters the
Spark interpreter requires. See the following material for details about these
parameters:
Spark Python
You must install Python in your MapR cluster to run Python code with the Spark interpreter. You do not need to install it in your container since Spark runs in YARN cluster mode.
%spark.pyspark
By default, this invokes Python 2. To switch Python versions, see Python Version.
To install custom Python packages, see Installing Custom Packages for PySpark. This also describes how to use Python 3 with custom packages.
SparkR
The Zeppelin container includes R. Some Apache SparkR jobs require you to install R on your MapR cluster nodes to run these jobs in the Spark interpreter.
%spark.r
Spark Jobs
By default, the Spark interpreter is configured to submit Apache Spark jobs in YARN client mode. The interpreter does not support YARN cluster mode. Make sure you follow the steps described at Installing Spark on YARN to install Spark on your MapR cluster.
To run Spark jobs in parallel, you must modify the Spark interpreter to instantiate Per Note:
You can set scoped to either of the two options.
Hive Tables
To access Apache Hive tables using the Spark interpreter, you must make the
hive-site.xml
configuration file from your Hive cluster
available to Spark running in your Zeppelin container. Follow the same steps that
describe how to
access Hive tables with the Livy interpreter.