HPE Ezmeral Data Fabric 6.1.x is In Maintenance and transitions to "End of Maintenance" in June 2024. Please see the latest documentation.

About MapR 6.1
This site contains the main documentation for Version 6.1 of the MapR Converged Data Platform, including installation, configuration, administration, and reference information.
6.1 Installation
This section contains information about installing and upgrading MapR software. It also contains information about how to migrate data and applications from an Apache Hadoop cluster to a MapR cluster.
6.1 MapR Data Platform
MapR Data Platform is the industry-leading data platform for AI and analytics that solves enterprise business needs.
6.1 Administration
This section describes how to manage the nodes and services that make up a cluster.
6.1 Development
This section contains information related to application development for Ezmeral ecosystem components and MapR Data Platform products, including the file system, Database (Key-Value and JSON), and Event Streams.
- Application Development Process
  Before you start developing applications on the MapR Data Platform platform, consider how you will get the data into the platform, the storage format of the data, the type of processing or modeling that is required, and how the data will be accessed.
- MapR XD and Apps
  The following sections provide information about accessing the MapR XD with C and Java applications.
- MapR Database and Apps
  This section contains information about developing client applications for JSON and key-value tables.
- MapR Event Store For Apache Kafka and Apps
  MapR Event Store For Apache Kafka brings integrated publish and subscribe messaging to MapR Data Platform.
- MapReduce and Apps
  This section contains information associated with developing YARN applications.
- MapR Data Science Refinery
  The MapR Data Science Refinery product is an easy-to-deploy and scalable data science toolkit with native access to all platform assets and superior out-of-the-box security.
- MapR Data Fabric for Kubernetes
  This section describes how to leverage the capabilities of the MapR Data Fabric for Kubernetes.
- Ecosystem Components
  The following sections provide information about each open-source project that is supported by the MapR Data Platform.
  - Ezmeral Ecosystem Packs
  - AsyncHBase
  - Cascading
  - Apache Drill
  - Flume
  - HBase
  - HBase Client and MapR Database Binary Tables
  - HCatalog
  - Hive
  - HttpFS
  - Hue
  - Impala
  - Livy
    Apache Livy is primarily used to provide integration between Hue and Spark.
  - MapR Event Store For Apache Kafka Clients and Tools
    Describes the supported MapR Event Store For Apache Kafka tools and clients.
  - S3 Gateway
    The S3 gateway is a service that provides an S3-compatible interface to expose data in MapR Data Platform as objects. The S3 gateway manages all inbound S3 API requests to put data into and get data out of cloud storage.
  - Myriad
  - Oozie
  - Pig
  - Sentry
  - Apache Spark
    - Getting Started with Spark Interactive Shell
      After you have a basic understanding of Apache Spark and have it installed and running on your cluster, you can use it to load datasets, apply schemas, and query data from the Spark interactive shell.
    - Apache Spark Feature Support
      MapR Data Platform supports most Apache Spark features. However, there are some exceptions.
    - Spark Standalone
    - Spark on YARN
    - Spark configure.sh
      Starting in the EEP 4.0 release, run configure.sh -R to complete your Spark configuration when manually installing Spark or upgrading to a new version.
    - Spark SQL Thrift Server
      Spark SQL Thrift (Spark Thrift) was developed from Apache Hive HiveServer2 and operates like HiveSever2 Thrift server.
    - Spark History Server SSL
      Describes how to enable SSL for Spark History Server.
    - MapR Database Connectors for Apache Spark
      This section describes the MapR Database connectors that you can use with Apache Spark.
      - Understanding the MapR Database OJAI Connector for Spark
        Using the MapR Database OJAI connector for Spark enables you build real-time and batch pipelines between your data and MapR Database JSON. Before getting started, it is important that you understand Spark terminology and workflow, system requirements and support, and OJAI connector and API features.
        Configuring the MapR Database OJAI Connector for Apache Spark
        Before using the MapR Database OJAI Connector for Apache Spark, you must edit the pom.xml file for your project.
        Loading Data from MapR Database Using the MapR Database OJAI Connector for Apache Spark
        The MapR Database OJAI Connector for Apache Spark supports loading data as an Apache Spark RDD. Starting in the EEP 4.0 release, the connector introduces support for Apache Spark DataFrames and Datasets. DataFrames and Datasets perform better than RDDs. Whether you load your MapR Database data as a DataFrame or Dataset depends on the APIs you prefer to use. It is also possible to convert an RDD to a DataFrame.
        Loading Data from MapR Database as an Apache Spark RDD
        Loading Data from MapR Database as an Apache Spark DataFrame
        Optimizing MapR Database Lookups in Spark Jobs
        The lookupFromMapRDB() API utilizes the primary and secondary indexes on a MapR Database table to optimize table lookups and outputs the results to an Apache Spark DataFrame.
        Loading Data into a DataFrame Using an Explicit Schema
        If you know the schema of your data, you can specify an explicit schema when loading a DataFrame.
        Loading Data into a DataFrame Using a Type Parameter
        If the structure of your data maps to a class in your application, you can specify a type parameter when loading into a DataFrame.
        Loading Data into a DataFrame Using Schema Inference
        If you do not know the schema of the data, you can use schema inference to load data into a DataFrame. This section describes how to use schema inference and restrictions that apply
        Type Mapping Between MapR Database JSON and DataFrames
        This table maps data types between MapR Database JSON OJAI and Apache Spark DataFrame.
        Loading Data from MapR Database as an Apache Spark Dataset
        Projection and Filter Pushdown with Apache Spark DataFrames and Datasets
        Projection and filter pushdown improve query performance. When you apply the select and filter methods on DataFrames and Datasets, the MapR Database OJAI Connector for Apache Spark pushes these elements to MapR Database where possible.
        Converting an Apache Spark RDD to an Apache Spark DataFrame
        When APIs are only available on an Apache Spark RDD but not an Apache Spark DataFrame, you can operate on the RDD and then convert it to a DataFrame.
        Working with Complex JSON Document Types
        The MapR Database OJAI Connector for Apache Spark provides APIs to process JSON documents loaded from MapR Database.
        Saving Data to a MapR Database JSON Table
        The MapR Database OJAI Connector for Apache Spark provides an API to save an Apache Spark RDD to a MapR Database JSON table. Starting in the EEP 4.0 release, the connector introduces support for saving Apache Spark DataFrames and DStreams to MapR Database JSON tables.
        Using Serialization with the MapR Database OJAI Connector for Apache Spark
        In the context of the MapR Database OJAI Connector for Apache Spark, serialization refers to the methods that read and write objects into bytes. This section describes how to configure your application to use a more efficient serializer.
      - MapR Database Binary Connector for Apache Spark
        This section describes the three main interaction points between Spark and HBase APIs and provides examples for each interaction point.
    - Integrating Spark
      This section includes the following topics about configuring Spark to work with other ecosystem components.
    - Spark JDBC and ODBC Drivers
      MapR provides JDBC and ODBC drivers so you can write SQL queries that access the Apache Spark data-processing engine. This section describes how to download the drivers, and install and configure them.
    - Spark API Changes
      This topic describes the public API changes that occurred for specific Spark versions.
    - Structured Streaming in Spark
      Starting in EEP 5.0.0, structured streaming is supported in Spark.
    - PAM Authentication for Spark
      Spark supports PAM authentication on secure MapR clusters.
    - Read or Write LZO Compressed Data for Spark
      This topic provides details for reading or writing LZO compressed data for Spark.
    - Ports Used by Spark
      To run a Spark job from a client node, ephemeral ports should be opened in the cluster for the client from which you are running the Spark job.
    - ACL Configuration for Spark
      Starting in the EEP 6.0 release, the ACL configuration for Spark is disabled by default.
  - Sqoop
  - YARN
  - Zeppelin
- Maven and MapR
  This section discusses topics associated with Maven and MapR.
- Developer's Reference
  This section contains in-depth information for the developer.
- API Documentation
  MapR Data Platform supports public APIs for MapR File System, MapR Database, and MapR Event Store For Apache Kafka. These APIs are available for application-development purposes.
Other Docs
This section contains release-independent information, including: MapR Installer documentation, Ecosystem release notes, interoperability matrices, security vulnerabilities, and links to other MapR version documentation.
Glossary
Definitions for commonly used terms in MapR Converged Data Platform environments.

Loading Data from MapR Database as an Apache Spark DataFrame

To load data from a MapR Database JSON table into an Apache Spark DataFrame, invoke the following API:

For loading as a DataFrame, apply the following method on a SparkSession object:

def loadFromMapRDB[T](tableName: String, 
              schema: StructType): DataFrame  
              
import com.mapr.db.spark.sql._

val df = sparkSession.loadFromMapRDB[T]("/tmp/user_profiles"): DataFrame

For loading as a DataFrame (Datasets of Row), apply the following method on a MapRDBJavaSession object:

def loadFromMapRDB(tableName: String, schema: StructType, sampleSize: Double):DataFrame
              
import com.mapr.db.spark.sql.api.java.MapRDBJavaSession;
import org.apache.spark.sql.SparkSession;
              
MapRDBJavaSession maprSession = new MapRDBJavaSession(spark);
maprSession.loadFromMapRDB("/tmp/user_profiles");

NOTE Java supports only DataSets of Row (Dataset<Row>).

For loading as a DataFrame, apply the following method on a SparkSession object:

loadFromMapRDB(table_name, schema, sample_size)
              
from pyspark.sql import SparkSession
              
df = spark.loadFromMapRDB("/tmp/user_profiles")

NOTE PySpark supports only DataFrames (Dataset<Row>).

NOTE The only required parameter to the methods is tableName. All the others are optional.

This creates a DataFrame object corresponding to the MapR Database table specified by the tableName parameter.

Both DataFrames and MapR Database tables work with structured data. DataFrames need a fixed schema, whereas MapR Database allows for a flexible schema. When loading data into a DataFrame, you can map your data to a schema by specifying the schema parameter in the loadFromMapRDB call. You can also provide an application class as the type [T] parameter in the call. These two approaches are the preferred methods for loading data into DataFrames.

For data exploration use cases, you might not know the schema of your MapR Database table. For those situations, the MapR Database OJAI connector for Apache Spark can infer the schema by sampling data from the table.

Whenever possible, the MapR Database OJAI Connector for Apache Spark pushes projections and filters for better performance. This allows MapR Database to project and filter data before returning it to your client application.

The following subtopics describe these techniques.