Spark 2.0.0 Developer Preview

Apache Spark is an open-source processing engine that you can use to process Hadoop data. Although MapR does not yet ship a Spark 2.0.0 package, you can install and use Spark 2.0.0 on a non-secure MapR 5.1 cluster or on a secure MapR 5.1 cluster that uses MapR-SASL authentication.

NOTE: The installation and integration steps must be performed on all Spark nodes unless otherwise noted. Instead of repeating the steps on each node, you can perform the steps on one node and then copy the entire Spark directory to each node where you want Spark installed.

For information on the Developer Preview, go to mapr.com.

Installing Spark

Complete the following steps to install Spark 2.0.0 on a MapR 5.1 cluster:

Go to the Apache Spark downloads page: http://spark.apache.org/downloads.html.
Select the 2.0.0 (July 26 2016) release.
Select the package type based on how you plan to use Spark.
- If you plan to use Hive with Spark, select Source Code.
- If you do not plan to use Spark with Hive, select Pre-build with user-provided Hadoop.
Select the Direct Download method.
Download the tarball to each node that you want to install Spark on.

For the source code package, complete the following steps to build Spark.

Select a folder to extract the source code and build Spark.

Run the following commands to extract and view the source code:

tar -xvzf spark-2.0.0.tgz <selected_folder>
cd <selected_folder>/spark-2.0.0/

Update following parameters in the <selected_folder>/spark-2.0.0/pom.xml file with new values:

Parameter	New Value
curator.version	2.7.1
hive.group	org.apache.hive
hive.version	1.2.0-mapr-1607
hive.version.short	1.2.0
datanucleus-core.version	4.1.6

Add the MapR Repository to the <selected_folder>/spark-2.0.0/pom.xml file:

<repository>
<id>mapr-repo</id>
<name>MapR Repository</name>
<url>http://repository.mapr.com/maven/</url>
<releases>
  <enabled>true</enabled>
</releases>
<snapshots>
   <enabled>false</enabled>
</snapshots>
</repository>

In the <selected_folder>/spark-2.0.0 directory, run the following commands to change the scala version and build Spark with Hive:
- ```
./dev/change-scala-version.sh 2.10
```
```
./dev/make-distribution.sh --tgz -Phadoop-provided -Pyarn -Phive -Phive-thriftserver -Dscala-2.10
```

On each Spark node, create a Spark directory in the MapR Installation directory.
```
sudo mkdir /opt/mapr/spark 
```
On each Spark node, extract the tarball into the opt/mapr/spark directory.
- For the pre-built package type:
```
tar -xvzf spark-2.0.0-bin-without-hadoop.tgz -C /opt/mapr/spark/
```
- For the package that you built from source in step 6e:
```
tar -xvzf spark-2.0.0-bin-2.2.0.tgz -C /opt/mapr/spark/
```
On each Spark node, make sure that the mapr user owns the files and folders in the new spark directory.
```
cd /opt/mapr/spark/
sudo chown -R mapr:mapr spark-2.0.0* 
```
On each Spark node, create a spark-env.sh file from its template file in the /opt/mapr/spark/spark-2.0.0*/conf directory.
```
cd /opt/mapr/spark/spark-2.0.0*/conf
cp spark-env.sh.template spark-env.sh 
```
If you want to run Spark HistoryServer, complete the following steps on each Spark node:
1. Create a spark-defaults.conf file from its template file in the /opt/mapr/spark/spark-2.0.0*/conf directory.
```
cp spark-defaults.conf.template spark-defaults.conf 
```
2. Add the event log directory to the /opt/mapr/spark/spark-2.0.0*/conf/spark-defaults.conf file on each Spark node.
```
spark.eventLog.dir         maprfs:///apps/spark
```

On each Spark node, set the following properties in the /opt/mapr/spark/spark-2.0.0*/conf/spark-env.sh file:

export SPARK_HOME=<set to spark home>
export HADOOP_HOME=<set to hadoop home eg:/opt/mapr/hadoop/hadoop-2.7.0>
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop 
MAPR_HADOOP_CLASSPATH=`hadoop classpath`:/opt/mapr/lib/slf4j-log4j12-1.7.5.jar:
MAPR_HADOOP_JNI_PATH=`hadoop jnipath`
export SPARK_LIBRARY_PATH=$MAPR_HADOOP_JNI_PATH
MAPR_SPARK_CLASSPATH="$MAPR_HADOOP_CLASSPATH"
SPARK_DIST_CLASSPATH=$MAPR_SPARK_CLASSPATH 
# Security status
source /opt/mapr/conf/env.sh
if [ "$MAPR_SECURITY_STATUS" = "true" ]; then   
SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dmapr_sec_enabled=true"
fi

NOTE: For the source code package, SPARK_HOME should be set to /opt/mapr/spark/spark-2.0.0-bin-2.2.0/. For the pre-built package, SPARK_HOME should be set to /opt/mapr/spark/spark-2.0.0-bin-without-hadoop.

To configure standalone mode, complete these additional steps:
1. On the SparkMaster node, copy the $SPARK_HOME/conf/slaves.template file to create $SPARK_HOME/conf/slaves.
```
cp $SPARK_HOME/conf/slaves.template $SPARK_HOME/conf/slaves
```
2. In the slaves file, add the hostnames of the SparkWorker nodes. Put one worker node hostname on each line. For example:
```
local host
worker-node-1
worker-node-2
```
3. Set up passwordless ssh for the mapr user such that the SparkMaster node has access to all slave nodes defined in the conf/slaves file.
4. On each Spark node, add the SparkMaster host and port to the conf/spark-defaults.conf file.
```
spark.master    spark://<hostname>:7077 
```
5. On the SparkMaster node, start the master and slave nodes by running the following command:
```
$SPARK_HOME/sbin/start-all.sh 
```

To test the installation, run the following command on a node where you have installed Spark 2.0.0:

sudo -u mapr ./bin/spark-submit --class org.apache.spark.examples.DFSReadWriteTest --master
      spark://<SparkNode>:7077 --files ./README.md ./examples/jars/spark-examples_2.10-2.0.0-SNAPSHOT.jar
      ./README.md /user/mapr/

Configuring and Using Spark

For information on how to configure and use Spark, see MapR's Spark Standalone and Spark On YARN documentation. You can also see the Apache Spark 2.0.0 documentation. In general, MapR does not duplicate the documentation that is provided on the Apache site.

Integrating Spark with Other Ecosystems

You can use Spark 2.0.0 along with HBase 1.1-1602 and Hive 1.2.1-1607. After you complete the steps to install Spark, perform the steps to integrate Spark with each component that you want to use.

Integrate Spark with HBase

To integrate Spark with HBase, complete the following steps on each Spark node.

Verify that the HBase RegionServer is installed on each Spark node.
On each node where HBase is installed, add the following property to /opt/mapr/hbase/hbase-1.1.1/conf/hbase-site.xml:
```
<property>   
      <name>hbase.table.sanity.checks</name>   
      <value>false</value>
      </property>
```
Perform one of the following options:
- Copy the hbase-site.xml file from /opt/mapr/hbase/hbase-1.1.1/conf/ to $SPARK_HOME/conf/.
```
cp $HBASE_HOME/conf/hbase-site.xml $SPARK_HOME/conf/
```
- Create a symbolic link to hbase-site.xml in $SPARK_HOME/conf/:
```
ln -s $HBASE_HOME/conf/hbase-site.xml $SPARK_HOME/conf/
```

In the $SPARK_HOME/conf/spark-env.sh file, append the HBase classpath to MAPR_SPARK_CLASSPATH:

MAPR_HBASE_CLASSPATH=`hbase classpath`
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$MAPR_HBASE_CLASSPATH"

Integrate Spark with Hive

To integrate Spark with Hive, complete the following steps on each Spark node.

If you plan to use Spark Standalone with Hive, verify that Hive is installed on the Spark nodes from which you want to launch Spark jobs.
Perform one of the following options:
- Copy the hive-site.xml file from /opt/mapr/hive/hive-1.2/conf/ to $SPARK_HOME/conf/.
```
cp /opt/mapr/hive/hive-1.2/conf/hive-site.xml $SPARK_HOME/conf/
```
- Create a symbolic link to hive-site.xml in $SPARK_HOME/conf/:
```
ln -s /opt/mapr/hive/hive-1.2/conf/hive-site.xml $SPARK_HOME/conf/
```

Add the following property to $SPARK_HOME/conf/hive-site.xml:

<property> 
<name>datanucleus.schema.autoCreateTables</name> 
<value>true</value>
</property>

In the $SPARK_HOME/conf/spark-env.sh file, add the following configurations:

MAPR_HIVE_CLASSPATH="$(find /opt/mapr/hive/hive-1.2/lib/* -name
'*.jar' -not -name '*derby*' -printf '%p:' | sed 's/:$//')"
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$MAPR_HIVE_CLASSPATH

In the $SPARK_HOME/conf directory, create a spark-defaults.conf file.
```
cp spark-defaults.conf.template spark-defaults.conf
```

In the $SPARK_HOME/conf/spark-defaults.conf file, add the following configurations:

spark.sql.hive.metastore.version  1.2.1
spark.sql.hive.metastore.sharedPrefixes
com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
spark.yarn.dist.files     
/opt/mapr/hive/hive-1.2/lib/datanucleus-api-jdo-4.2.1.jar,/opt/mapr/hive/hive-1.2/lib/datanucleus-core-4.1.6.jar,/opt/mapr/hive/hive-1.2/lib/datanucleus-rdbms-4.1.7.jar,/opt/mapr/hive/hive-1.2/conf/hive-site.xml
spark.executor.extraClassPath

To test the integration between Spark and Hive, run one of the following command:

For Spark Standalone:

bin/spark-submit --class org.apache.spark.examples.sql.hive.SparkHiveExample --master spark://<SparkNode>:7077 ./examples/jars/spark-examples_2.10-2.0.0.jar

For Spark on YARN (yarn-client mode):

bin/spark-submit --class org.apache.spark.examples.sql.hive.SparkHiveExample --master yarn --deploy-mode client ./examples/jars/spark-examples_2.10-2.0.0.jar

NOTE: The SparkHiveExample does not work in yarn-cluster mode.