About Release 7.0
This site contains documentation for HPE Ezmeral Data Fabric release 7.0, including installation, configuration, administration, and reference content, as well as content for the associated ecosystem components and drivers.
7.0 Installation
This section contains information about installing and upgrading HPE Ezmeral Data Fabric software. It also contains information about how to migrate data and applications from an Apache Hadoop cluster to a HPE Ezmeral Data Fabric cluster.
7.0 Data Fabric
HPE Ezmeral Data Fabric is the industry-leading data platform for AI and analytics that solves enterprise business needs.
7.0 Administration
This section describes how to manage the nodes and services that make up a cluster.
7.0 Development
This section contains information related to application development for Ezmeral ecosystem components and HPE Ezmeral Data Fabric products, including the file system, Database (Key-Value and JSON), and Event Streams.
- Application Development Process
  Before you start developing applications on the HPE Ezmeral Data Fabric platform, consider how you will get the data into the platform, the storage format of the data, the type of processing or modeling that is required, and how the data will be accessed.
- File Store and Apps
  The following sections provide information about accessing the File Store with C and Java applications.
- HPE Ezmeral Data Fabric Database and Apps
  This section contains information about developing client applications for JSON and key-value tables.
- HPE Ezmeral Data Fabric Streams and Apps
  HPE Ezmeral Data Fabric Streams brings integrated publish and subscribe messaging to HPE Ezmeral Data Fabric.
- MapReduce and Apps
  This section contains information associated with developing YARN applications.
- Kubernetes Interfaces for Data Fabric
  This section describes how to leverage the capabilities of the Kubernetes Interfaces for Data Fabric.
- Ecosystem Components
  The following sections provide information about each open-source project that is supported by the HPE Ezmeral Data Fabric.
  - Ecosystem Packs
  - Apache Airflow
    This topic provides an overview of Apache Airflow on HPE Ezmeral Data Fabric.
  - AsyncHBase
  - Cascading
  - Apache Drill
  - Flume
  - Hadoop
  - HBase
  - HBase Client and HPE Ezmeral Data Fabric Database Binary Tables
  - HCatalog
  - Hive
  - HttpFS
  - Hue
  - Livy
    Apache Livy is primarily used to provide integration between Hue and Spark.
  - HPE Ezmeral Data Fabric Streams Clients and Tools
    Describes the supported HPE Ezmeral Data Fabric Streams tools and clients.
  - S3 Gateway
    The S3 gateway is a service that provides an S3-compatible interface to expose data in HPE Ezmeral Data Fabric as objects. The S3 gateway manages all inbound S3 API requests to put data into and get data out of cloud storage.
  - Oozie
  - Pig
  - Apache Spark
    - Getting Started with Spark Interactive Shell
      After you have a basic understanding of Apache Spark and have it installed and running on your cluster, you can use it to load datasets, apply schemas, and query data from the Spark interactive shell.
    - Apache Spark Feature Support
      HPE Ezmeral Data Fabric supports most Apache Spark features. However, there are some exceptions.
    - Spark Standalone
    - Spark on YARN
    - Spark configure.sh
      Starting in the EEP 4.0 release, run configure.sh -R to complete your Spark configuration when manually installing Spark or upgrading to a new version.
    - Spark SQL Thrift Server
      Spark SQL Thrift (Spark Thrift) was developed from Apache Hive HiveServer2 and operates like HiveSever2 Thrift server.
    - Spark History Server SSL
      Describes how to enable SSL for Spark History Server.
    - HPE Ezmeral Data Fabric Database Connectors for Apache Spark
      This section describes the HPE Ezmeral Data Fabric Database connectors that you can use with Apache Spark.
    - Integrating Spark
      This section includes the following topics about configuring Spark to work with other ecosystem components.
      - Integrate Spark-SQL (Spark 2.3.1 and later) with Avro
        You integrate Spark-SQL with Avro when you want to read and write Avro data. This information is for Spark 2.3.0 or later users.
      - Integrate Spark-SQL (Spark 1.6.1) with Avro
        You integrate Spark-SQL with Avro when you want to read and write Avro data. This information is for Spark 1.6.1 or earlier users.
      - Integrate Spark with HBase
        Integrate Spark with HBase or HPE Ezmeral Data Fabric Database when you want to run Spark jobs on HBase or HPE Ezmeral Data Fabric Database tables.
      - Integrate Spark-SQL (Spark 2.0.1 and later) with Hive
        You integrate Spark-SQL with Hive when you want to run Spark-SQL queries on Hive tables. This information is for Spark 2.0.1 or later users.
      - Integrate Spark-SQL (Spark 1.6.1) with Hive
        You integrate Spark-SQL with Hive when you want to run Spark-SQL queries on Hive tables. This information is for Spark 1.6.1 or earlier users.
      - Integrate Spark with HPE Ezmeral Data Fabric Streams
        Integrate Spark with MapR Streams to enable Spark to query HPE Ezmeral Data Fabric Streams for new messages at a given interval, process any new messages that are available, and also publish messages into HPE Ezmeral Data Fabric Streams.
      - Integrate Spark with R
        You integrate Spark with R when you want to run R programs as Spark jobs.
      - Integrate Spark with Kafka
        From EEP-5.0.0, Spark can be integrated with Kafka-1.0. You can configure a Spark application to produce Kafka messages.
    - Spark JDBC and ODBC Drivers
      MapR provides JDBC and ODBC drivers so you can write SQL queries that access the Apache Spark data-processing engine. This section describes how to download the drivers, and install and configure them.
    - Spark API Changes
      This topic describes the public API changes that occurred for specific Spark versions.
    - Structured Streaming in Spark
      Starting in EEP 5.0.0, structured streaming is supported in Spark.
    - PAM Authentication for Spark
      Spark supports PAM authentication on secure MapR clusters.
    - Read or Write LZO Compressed Data for Spark
      This topic provides details for reading or writing LZO compressed data for Spark.
    - Ports Used by Spark
      To run a Spark job from a client node, ephemeral ports should be opened in the cluster for the client from which you are running the Spark job.
    - ACL Configuration for Spark
      Starting in the EEP 6.0 release, the ACL configuration for Spark is disabled by default.
  - Sqoop
  - YARN
- Maven and the HPE Ezmeral Data Fabric
  This section discusses topics associated with Maven and the HPE Ezmeral Data Fabric.
- Developer's Reference
  This section contains in-depth information for the developer.
- API Documentation
  HPE Ezmeral Data Fabric supports public APIs for file system, HPE Ezmeral Data Fabric Database, and HPE Ezmeral Data Fabric Streams. These APIs are available for application-development purposes.
Other Docs
This section contains release-independent information, including: Installer documentation, Ecosystem release notes, interoperability matrices, security vulnerabilities, and links to other data-fabric version documentation.
Glossary
Definitions for commonly used terms in MapR Converged Data Platform environments.

Integrate Spark with Kafka

From EEP-5.0.0, Spark can be integrated with Kafka-1.0. You can configure a Spark application to produce Kafka messages.

About this task

NOTE Starting from EEP-8.0.0, HPE Ezmeral Data Fabric does not support spark-streaming-kafka-producer. To learn about Kafka integration on Apache Spark 3.1.2 and later in HPE Ezmeral Data Fabric, see Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher).

Procedure

Add the following dependency:

groupId = org.apache.spark
artifactId = spark-streaming-kafka-producer_2.11
version = <spark_version>-mapr-<mapr_eco_version>

When you write the Spark program, import and use classes from:
```
org.apache.spark.streaming.kafka.producer._ 
org.apache.spark.streaming.dstream.
```
The import of org.apache.spark.streaming.stream.DStream adds the following method from DStream:
```
sendToKafka(topic: String, conf: ProducerConf)
```

In the code below, calling sendToKafka will send numMessages messages to the set of topics specified by the topics parameter:

val producerConf = new ProducerConf(
bootstrapServers = kafkaBrokers.split(",").toList)
                        
val items = (0 until numMessages.toInt).map(i => Item(i, i).toString)
val defaultRDD: RDD[String] = ssc.sparkContext.parallelize(items)
val dStream: DStream[String] = new ConstantInputDStream[String](ssc, defaultRDD)
                        
dStream.foreachRDD(_.sendToKafka(topics, producerConf))
dStream.count().print()

Example

Source code for a sample producer program can be found at https://github.com/mapr/spark/blob/2.2.1-mapr-1803/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaProducerExample.scala