Spark 2.0.0 Developer Preview
Apache Spark is an open-source processing engine that you can use to process Hadoop data. Although MapR does not yet ship a Spark 2.0.0 package, you can install and use Spark 2.0.0 on a non-secure MapR 5.1 cluster or on a secure MapR 5.1 cluster that uses MapR-SASL authentication.
For information on the Developer Preview, go to mapr.com.
Installing Spark
Complete the following steps to install Spark 2.0.0 on a MapR 5.1 cluster:
- Go to the Apache Spark downloads page: http://spark.apache.org/downloads.html.
- Select the 2.0.0 (July 26 2016) release.
- Select the package type based on how you plan to use Spark.
- If you plan to use Hive with Spark, select Source Code.
- If you do not plan to use Spark with Hive, select Pre-build with user-provided Hadoop.
- Select the Direct Download method.
- Download the tarball to each node that you want to install Spark on.
- For the source code package, complete the following steps to build Spark.
- Select a folder to extract the source code and build Spark.
- Run the following commands to extract and view the source
code:
tar -xvzf spark-2.0.0.tgz <selected_folder> cd <selected_folder>/spark-2.0.0/
- Update following parameters in the
<selected_folder>/spark-2.0.0/pom.xml file with new
values:
Parameter New Value curator.version 2.7.1 hive.group org.apache.hive hive.version 1.2.0-mapr-1607 hive.version.short 1.2.0 datanucleus-core.version 4.1.6 - Add the MapR Repository to the
<selected_folder>/spark-2.0.0/pom.xml
file:<repository> <id>mapr-repo</id> <name>MapR Repository</name> <url>http://repository.mapr.com/maven/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository>
- In the <selected_folder>/spark-2.0.0 directory, run the
following commands to change the scala version and build Spark with Hive:
-
./dev/change-scala-version.sh 2.10
./dev/make-distribution.sh --tgz -Phadoop-provided -Pyarn -Phive -Phive-thriftserver -Dscala-2.10
-
- On each Spark node, create a Spark directory in the MapR Installation
directory.
sudo mkdir /opt/mapr/spark
- On each Spark node, extract the tarball into the opt/mapr/spark
directory.
- For the pre-built package
type:
tar -xvzf spark-2.0.0-bin-without-hadoop.tgz -C /opt/mapr/spark/
- For the package that you built from source in step 6e:
tar -xvzf spark-2.0.0-bin-2.2.0.tgz -C /opt/mapr/spark/
- For the pre-built package
type:
- On each Spark node, make sure that the
mapr
user owns the files and folders in the new spark directory.cd /opt/mapr/spark/ sudo chown -R mapr:mapr spark-2.0.0*
- On each Spark node, create a spark-env.sh file from its template file in the
/opt/mapr/spark/spark-2.0.0*/conf
directory.
cd /opt/mapr/spark/spark-2.0.0*/conf cp spark-env.sh.template spark-env.sh
- If you want to run Spark HistoryServer, complete the following steps on each Spark node:
- Create a spark-defaults.conf file from its template file in the
/opt/mapr/spark/spark-2.0.0*/conf directory.
cp spark-defaults.conf.template spark-defaults.conf
- Add the event log directory to the
/opt/mapr/spark/spark-2.0.0*/conf/spark-defaults.conf file on
each Spark node.
spark.eventLog.dir maprfs:///apps/spark
- Create a spark-defaults.conf file from its template file in the
/opt/mapr/spark/spark-2.0.0*/conf directory.
- On each Spark node, set the following properties in the
/opt/mapr/spark/spark-2.0.0*/conf/spark-env.sh
file:
export SPARK_HOME=<set to spark home> export HADOOP_HOME=<set to hadoop home eg:/opt/mapr/hadoop/hadoop-2.7.0> export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop MAPR_HADOOP_CLASSPATH=`hadoop classpath`:/opt/mapr/lib/slf4j-log4j12-1.7.5.jar: MAPR_HADOOP_JNI_PATH=`hadoop jnipath` export SPARK_LIBRARY_PATH=$MAPR_HADOOP_JNI_PATH MAPR_SPARK_CLASSPATH="$MAPR_HADOOP_CLASSPATH" SPARK_DIST_CLASSPATH=$MAPR_SPARK_CLASSPATH # Security status source /opt/mapr/conf/env.sh if [ "$MAPR_SECURITY_STATUS" = "true" ]; then SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dmapr_sec_enabled=true" fi
NOTE: For the source code package, SPARK_HOME should be set to /opt/mapr/spark/spark-2.0.0-bin-2.2.0/. For the pre-built package, SPARK_HOME should be set to /opt/mapr/spark/spark-2.0.0-bin-without-hadoop. - To configure standalone mode, complete these additional steps:
- On the SparkMaster node, copy the $SPARK_HOME/conf/slaves.template file to create
$SPARK_HOME/conf/slaves.
cp $SPARK_HOME/conf/slaves.template $SPARK_HOME/conf/slaves
- In the slaves file, add the hostnames of the SparkWorker nodes. Put one worker node
hostname on each line. For example:
local host worker-node-1 worker-node-2
- Set up passwordless ssh for the mapr user such that the SparkMaster node has access to all slave nodes defined in the conf/slaves file.
- On each Spark node, add the SparkMaster host and port to the
conf/spark-defaults.conf file.
spark.master spark://<hostname>:7077
- On the SparkMaster node, start the master and slave nodes by running the following
command:
$SPARK_HOME/sbin/start-all.sh
- On the SparkMaster node, copy the $SPARK_HOME/conf/slaves.template file to create
$SPARK_HOME/conf/slaves.
- To test the installation, run the following command on a node where you have installed
Spark 2.0.0:
sudo -u mapr ./bin/spark-submit --class org.apache.spark.examples.DFSReadWriteTest --master spark://<SparkNode>:7077 --files ./README.md ./examples/jars/spark-examples_2.10-2.0.0-SNAPSHOT.jar ./README.md /user/mapr/
Configuring and Using Spark
For information on how to configure and use Spark, see MapR's Spark Standalone and Spark On YARN documentation. You can also see the Apache Spark 2.0.0 documentation. In general, MapR does not duplicate the documentation that is provided on the Apache site.
Integrating Spark with Other Ecosystems
You can use Spark 2.0.0 along with HBase 1.1-1602 and Hive 1.2.1-1607. After you complete the steps to install Spark, perform the steps to integrate Spark with each component that you want to use.
Integrate Spark with HBase
To integrate Spark with HBase, complete the following steps on each Spark node.
- Verify that the HBase RegionServer is installed on each Spark node.
- On each node where HBase is installed, add the following property to
/opt/mapr/hbase/hbase-1.1.1/conf/hbase-site.xml:
<property> <name>hbase.table.sanity.checks</name> <value>false</value> </property>
- Perform one of the following options:
- Copy the hbase-site.xml file from /opt/mapr/hbase/hbase-1.1.1/conf/ to
$SPARK_HOME/conf/.
cp $HBASE_HOME/conf/hbase-site.xml $SPARK_HOME/conf/
- Create a symbolic link to hbase-site.xml in
$SPARK_HOME/conf/:
ln -s $HBASE_HOME/conf/hbase-site.xml $SPARK_HOME/conf/
- Copy the hbase-site.xml file from /opt/mapr/hbase/hbase-1.1.1/conf/ to
$SPARK_HOME/conf/.
- In the $SPARK_HOME/conf/spark-env.sh file, append the HBase classpath to
MAPR_SPARK_CLASSPATH:
MAPR_HBASE_CLASSPATH=`hbase classpath` SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$MAPR_HBASE_CLASSPATH"
Integrate Spark with Hive
To integrate Spark with Hive, complete the following steps on each Spark node.
- If you plan to use Spark Standalone with Hive, verify that Hive is installed on the Spark nodes from which you want to launch Spark jobs.
- Perform one of the following options:
- Copy the hive-site.xml file from /opt/mapr/hive/hive-1.2/conf/ to
$SPARK_HOME/conf/.
cp /opt/mapr/hive/hive-1.2/conf/hive-site.xml $SPARK_HOME/conf/
- Create a symbolic link to hive-site.xml in
$SPARK_HOME/conf/:
ln -s /opt/mapr/hive/hive-1.2/conf/hive-site.xml $SPARK_HOME/conf/
- Copy the hive-site.xml file from /opt/mapr/hive/hive-1.2/conf/ to
$SPARK_HOME/conf/.
- Add the following property to $SPARK_HOME/conf/hive-site.xml:
<property> <name>datanucleus.schema.autoCreateTables</name> <value>true</value> </property>
- In the $SPARK_HOME/conf/spark-env.sh file, add the following
configurations:
MAPR_HIVE_CLASSPATH="$(find /opt/mapr/hive/hive-1.2/lib/* -name '*.jar' -not -name '*derby*' -printf '%p:' | sed 's/:$//')" SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$MAPR_HIVE_CLASSPATH
- In the $SPARK_HOME/conf directory, create a spark-defaults.conf
file.
cp spark-defaults.conf.template spark-defaults.conf
- In the $SPARK_HOME/conf/spark-defaults.conf file, add the following configurations:
spark.sql.hive.metastore.version 1.2.1 spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni spark.yarn.dist.files /opt/mapr/hive/hive-1.2/lib/datanucleus-api-jdo-4.2.1.jar,/opt/mapr/hive/hive-1.2/lib/datanucleus-core-4.1.6.jar,/opt/mapr/hive/hive-1.2/lib/datanucleus-rdbms-4.1.7.jar,/opt/mapr/hive/hive-1.2/conf/hive-site.xml spark.executor.extraClassPath
- To test the integration between Spark and Hive, run one of the following command:
- For Spark Standalone:
bin/spark-submit --class org.apache.spark.examples.sql.hive.SparkHiveExample --master spark://<SparkNode>:7077 ./examples/jars/spark-examples_2.10-2.0.0.jar
- For Spark on YARN (yarn-client
mode):
bin/spark-submit --class org.apache.spark.examples.sql.hive.SparkHiveExample --master yarn --deploy-mode client ./examples/jars/spark-examples_2.10-2.0.0.jar
NOTE: The SparkHiveExample does not work in yarn-cluster mode.
- For Spark Standalone: