Understanding the MapR Database OJAI Connector for Spark

Using the MapR Database OJAI connector for Spark enables you build real-time and batch pipelines between your data and MapR Database JSON. Before getting started, it is important that you understand Spark terminology and workflow, system requirements and support, and OJAI connector and API features.

The MapR Database OJAI connector includes a set of APIs that enable you to write applications that consume MapR Database JSON tables and use them in Spark. The MapR Database OJAI Connector for Apache Spark is a companion to the MapR Database Binary Connector for Apache Spark, which provides the equivalent functionality for MapR Database Binary tables.

MapR Database OJAI Connector with Spark Workflow

You can use the MapR Database OJAI Connector to extract data from MapR Database or MapR File System and transform that data using either Spark or Spark SQL, and then load it into MapR Database JSON:

MapR Database OJAI Connector for Apache Spark Features

Principal features of the MapR Database OJAI Connector for Apache Spark include the following:

  • Support for Scala and, beginning with EEP 4.1, Java and Python APIs
    This matrix shows the programming languages and features supported:
    Scala Java Python
    RDD Yes Yes No
    DataFrame Yes Yes Yes
    Dataset Yes Yes No
    DStream Yes No No
  • APIs that enable you to load data from a MapR Database JSON table to an Apache Spark RDD, DataFrame, or Dataset
  • Projection and filter pushdown for better performance
  • Custom partitioner for RDDs that enables you to partition data for better performance
  • APIs that save an Apache Spark RDD, DataFrame, or DStream to a MapR Database JSON table using either normal or bulk insert
  • Support for Scala and Java bean classes
  • Support for data locality
  • Support for secondary indexes starting from EEP 7.0.0 and EEP 6.3.1.

The following features are not supported:

  • MapR Database Binary tables

    Only MapR Database JSON tables are supported; access to MapR Database binary tables is provided through the MapR Database Binary Connector.

  • Secondary indexes are not suported for previous EEP 7.0.0 and EEP 6.3.1 versions.

Supported Product Versions and System Requirements

To use the MapR Database OJAI Connector for Apache Spark, you must have the following minimum software versions:

  • MapR: 5.2.1 or later
  • EEP 3.0 or later
  • Spark 2.1.0 or later
  • Scala 2.11 or later
  • Java 8 or later
Support for DataFrames and Datasets is available starting in the EEP 4.0 release.

OJAI API

The MapR Database OJAI Connector for Apache Spark uses the OJAI API internally to access MapR Database JSON tables.