HPE Ezmeral Data Fabric 6.1.x is In Maintenance and transitions to "End of Maintenance" in June 2024. Please see the latest documentation.

About MapR 6.1
This site contains the main documentation for Version 6.1 of the MapR Converged Data Platform, including installation, configuration, administration, and reference information.
6.1 Installation
This section contains information about installing and upgrading MapR software. It also contains information about how to migrate data and applications from an Apache Hadoop cluster to a MapR cluster.
6.1 MapR Data Platform
MapR Data Platform is the industry-leading data platform for AI and analytics that solves enterprise business needs.
6.1 Administration
This section describes how to manage the nodes and services that make up a cluster.
6.1 Development
This section contains information related to application development for Ezmeral ecosystem components and MapR Data Platform products, including the file system, Database (Key-Value and JSON), and Event Streams.
- Application Development Process
  Before you start developing applications on the MapR Data Platform platform, consider how you will get the data into the platform, the storage format of the data, the type of processing or modeling that is required, and how the data will be accessed.
- MapR XD and Apps
  The following sections provide information about accessing the MapR XD with C and Java applications.
- MapR Database and Apps
  This section contains information about developing client applications for JSON and key-value tables.
- MapR Event Store For Apache Kafka and Apps
  MapR Event Store For Apache Kafka brings integrated publish and subscribe messaging to MapR Data Platform.
- MapReduce and Apps
  This section contains information associated with developing YARN applications.
- MapR Data Science Refinery
  The MapR Data Science Refinery product is an easy-to-deploy and scalable data science toolkit with native access to all platform assets and superior out-of-the-box security.
  - Zeppelin Docker Container on the MapR Data Platform
    The MapR Data Science Refinery product includes a preconfigured Apache Zeppelin notebook, packaged as a Docker container. Apache Zeppelin is an open source, Web-based data science notebook. You can use it with MapR components to conduct data discovery, ETL, machine learning, and data visualization.
    - Running the Zeppelin Container
      To run the Apache Zeppelin container, you must access the Zeppelin Docker image from the MapR Data Platform public repository, run the Docker image, and access the deployed container from your web browser. From your browser, you can create Zeppelin notebooks.
    - Understanding Zeppelin Interpreters
      Apache Zeppelin interpreters enable you to access specific languages and data processing backends. This section describes the interpreters you can use with the MapR system and the use cases they serve.
    - Configuring Zeppelin Interpreters
      Out-of-box, the interpreters in Apache Zeppelin on the MapR Data Platform are preconfigured to run against different backend engines. You may need to perform manual steps to configure the Livy, Spark, and JDBC interpreters. No additional steps are needed to configure and run the Pig and Shell interpreters. You can configure the idle timeout threshold for interpreters.
      - Configuring the Livy Interpreter
        The Livy interpreter provides support for Spark Python, SparkR, Basic Spark, and Spark SQL jobs. To use the Livy interpreter for these variations of Spark, you must take certain actions, including configuring Zeppelin and installing software on your MapR Data Platform cluster.
      - Configuring the Spark Interpreter
        The Spark interpreter is available starting in the 1.1 release of the MapR Data Science Refinery. It provides support for Spark Python, SparkR, Basic Spark, and Spark SQL jobs. To use the Spark interpreter for these variations of Spark, you must take certain actions, including configuring Zeppelin and installing software on your MapR cluster.
      - Installing Custom Packages for PySpark
        You can install custom Python packages either by manually installing packages on each node in your MapR Data Platform cluster or by using Conda. Using Conda allows you to perform the install from your Zeppelin host node without having to directly access your MapR cluster. The topics in this section describe the instructions for each method as well as instructions for Python 2 vs Python 3.
        Manually Installing Custom Packages for PySpark
        Use the Python package manager, pip (or pip3 for PySpark3), to manually install custom packages on each node in your MapR Data Platform cluster. You need administrative access on your cluster nodes to install the packages.
        Installing Custom Packages for PySpark Using Conda
        To install custom packages for Python 2 (or Python 3) using Conda, you must create a custom Conda environment and pass the path of the custom environment in your docker run command.
      - Configuring the JDBC Interpreter for Apache Drill and Apache Hive
        Apache Zeppelin on the MapR Data Platform includes custom JDBC interpreters for Apache Drill and Apache Hive. Fields in each interpreter are prepopulated, but you need to customize them for your environment.
    - Troubleshooting Zeppelin
      This section describes how to resolve common problems you may encounter when using Apache Zeppelin.
    - Using Visualization Packages in Zeppelin
      Apache Zeppelin supports the Helium framework. Using visualization packages, you can view your data through area charts, bar charts, scatter charts, and other displays. To use a visualization package, you must enable it through the Helium repository browser in the Zeppelin UI. Like Zeppelin interpreters, Helium is automatically installed in your Zeppelin container.
    - Using Zeppelin to Access Different Backend Engines
      This section contains examples of how to use Apache Zeppelin interpreters to access the different backend engines. This includes running Apache Pig scripts, Apache Drill queries, Apache Hive queries, and Apache Spark jobs, as well as accessing MapR Database and MapR Event Store For Apache Kafka solutions.
    - Sharing Zeppelin Notebook Content
      By default, Zeppelin stores notebooks in the local filesystem in your container. An alternative is to store them in the MapR File System. This allows you to share the notebooks with other users.
  - Building your own MapR Data Science Refinery Docker Image
    MapR provides a preconfigured and prepackaged Docker image for the MapR Data Science Refinery. Starting with the1.3 release, you can build your own custom Docker image.
- MapR Data Fabric for Kubernetes
  This section describes how to leverage the capabilities of the MapR Data Fabric for Kubernetes.
- Ecosystem Components
  The following sections provide information about each open-source project that is supported by the MapR Data Platform.
- Maven and MapR
  This section discusses topics associated with Maven and MapR.
- Developer's Reference
  This section contains in-depth information for the developer.
- API Documentation
  MapR Data Platform supports public APIs for MapR File System, MapR Database, and MapR Event Store For Apache Kafka. These APIs are available for application-development purposes.
Other Docs
This section contains release-independent information, including: MapR Installer documentation, Ecosystem release notes, interoperability matrices, security vulnerabilities, and links to other MapR version documentation.
Glossary
Definitions for commonly used terms in MapR Converged Data Platform environments.

Manually Installing Custom Packages for PySpark

Use the Python package manager, pip (or pip3 for PySpark3), to manually install custom packages on each node in your MapR Data Platform cluster. You need administrative access on your cluster nodes to install the packages.

Procedure

Install the package manager using one of the following commands, depending on your operating system:

RedHat:

sudo yum install -y python-devel python-setuptools
sudo easy_install pip

sudo yum install -y python34-devel python34-setuptools
sudo easy_install-3.4 pip

SLES:

sudo zypper install -y python-devel python-setuptools
sudo easy_install pip

sudo zypper install -y python3-devel python3-setuptools
sudo easy_install-3.4 pip

Ubuntu:

sudo apt-get install -y python-dev python-setuptools
sudo easy_install pip

sudo apt-get install -y python3-dev python3-setuptools
sudo easy_install3 pip

Install the custom package using the utility you downloaded in the first step above.
The following example installs the matplotlib package:
```
sudo pip install matplotlib
```
```
sudo pip3 install matplotlib
```
You must install the package on each node in your MapR cluster where PySpark jobs will run. These are the nodes that contain a YARN NodeManager.
To verify successful installs, run the following code snippet in your Zeppelin UI:
```
%livy.pyspark

import sys
print(sys.version)

import matplotlib
print(matplotlib.__version__)
```
The code snippet returns output similar to the following:
```
2.7.5 (default, Nov 6 2016, 00:28:07) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]
2.1.0
```
```
3.4.5 (default, May 29 2017, 15:17:55) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]
2.1.0
```
The minor versions of Python and matplotlib may differ depending on the versions you install.