Integrate Spark with HBase

Integrate Spark with HBase or MapR Database when you want to run Spark jobs on HBase or MapR Database tables.

About this task

If you installed Spark with the MapR Installer, these steps are not required.

Procedure

  1. Configure the HBase version in the /opt/mapr/spark/spark-<version>/mapr-util/compatibility.version file:
    hbase_versions=<version>
    The HBase version depends on the current EEP and MapR version that you are running.
  2. If you want to create HBase tables with Spark, add the following property to hbase-site.xml:
    <property>
    hbase.table.sanity.checks</name>
    <value>false</value>
    </property>
  3. On each Spark node, copy the hbase-site.xml to the {SPARK_HOME}/conf/ directory.
    TIP Starting in the EEP 7.0.0 release, you do not have to complete step 3. Running configure.sh copies the hbase-site.xml file to the Spark directory automatically.
  4. Specify the hbase-site.xml file in the SPARK_HOME/conf/spark-defaults.conf file:
    spark.yarn.dist.files SPARK_HOME/conf/hbase-site.xml
  5. To verify the integration, complete the following steps:
    1. Create an HBase or MapR Database table:
      create '<table_name>' , '<column_family>'
    2. Run the following command as the mapr user or as a user that mapr impersonates:
      /opt/mapr/spark/spark-<spark_version>/bin/spark-submit --master <master> [--deploy-mode <deploy-mode>]  --class org.apache.hadoop.hbase.spark.example.rdd.HBaseBulkPutExample /opt/mapr/hbase/hbase-<hbase_versrion>/lib/hbase-spark-<hbase_version>-mapr.jar  <table_name>  <column_family>
      The master URL for the cluster is either spark://<host>:7077, yarn, or local (without deploy-mode). The deploy-mode is either client or cluster.
    3. Check the data in the HBase or MapR-DB table:
      hbase(main):001:0> scan '<table_name>'