Installing Custom Packages for PySpark Using Conda

To install custom packages for Python 2 (or Python 3) using Conda, you must create a custom Conda environment and pass the path of the custom environment in your docker run command.

Prerequisites

To install Conda, follow the instructions at https://conda.io/docs/user-guide/install/index.html.

For each step of the following steps, select the tab corresponding to the Python version you want to install.

Procedure

Create your custom Conda environment and archive it as a zip archive.
The following example creates a custom Conda environment with Python 2 and three packages (matplotlib, numpy, and pandas):
```
mkdir custom_pyspark_env
conda create -p ./custom_pyspark_env python=2 numpy pandas matplotlib
cd custom_pyspark_env
zip -r custom_pyspark_env.zip ./
```
The following example creates a custom Conda environment with Python 3 and three packages (matplotlib, numpy, and pandas):
```
mkdir custom_pyspark3_env
conda create -p ./custom_pyspark3_env python=3 numpy pandas matplotlib
cd custom_pyspark3_env
zip -r custom_pyspark3_env.zip ./
```
IMPORTANT Do not create an archive named pyspark.zip. This name is reserved for PySpark internals.

Launch the Zeppelin container, specifying the path of the Python archive in your docker run command.

You can specify the archive in one of the following ways:

Option 1: Specify the archive from the MapR File System by uploading the archive to the MapR File System
Option 2: Specify the archive from your local filesystem using a Docker mount point

Option 1

hadoop fs -put custom_pyspark_env.zip /python_envs/custom_pyspark_env.zip
docker run -it ... \
   -e ZEPPELIN_ARCHIVE_PYTHON=/python_envs/custom_pyspark_env.zip \ 
   ... \
   maprtech/data-science-refinery:v1.4.1_6.1.0_6.3.0_centos7

Option 2

docker run -it ... \
   -v /local/path/custom_pyspark_env.zip:/tmp/custom_pyspark_env.zip:ro \
   -e ZEPPELIN_ARCHIVE_PYTHON=/tmp/custom_pyspark_env.zip \ 
   ... \
   maprtech/data-science-refinery:v1.4.1_6.1.0_6.3.0_centos7

The path parameters in the sample command correspond to the following:

Full Path to Archive from Step 1	Mount Point of the Archive in your Container
`/local/path/custom_pyspark_env.zip`	`/tmp/custom_pyspark_env.zip`

If you want to use Python 3 instead of Python 2, set >>>>>>> Brought back DSR 1.3 content ZEPPELIN_ARCHIVE_PYTHON in one of the following ways:

Option 1: Specify the archive from MapR File System by uploading the archive to MapR File System
Option 2: Specify the archive from your local file system using a Docker mount point

Option 1

hadoop fs -put custom_pyspark3_env.zip /python_envs/custom_pyspark3_env.zip 
docker run -it ... \ 
   -e ZEPPELIN_ARCHIVE_PYTHON=/python_envs/custom_pyspark3_env.zip \
   ... \
   maprtech/data-science-refinery:v1.4.1_6.1.0_6.3.0_centos7

Option 2

docker run -it ... \
   -v /local/path/custom_pyspark3_env.zip:/tmp/custom_pyspark3_env.zip:ro \
   -e ZEPPELIN_ARCHIVE_PYTHON=/tmp/custom_pyspark3_env.zip \ 
   ... \
   maprtech/data-science-refinery:v1.4.1_6.1.0_6.3.0_centos7

The path parameters in the sample command correspond to the following:

Full Path to Archive from Step 1	Mount Point of the Archive in your Container
`/local/path/custom_pyspark3_env.zip`	`/tmp/custom_pyspark3_env.zip`

To verify that you have successfully installed the matplotlib package, run the following code snippet in your Zeppelin UI:

%livy.pyspark

import sys
print(sys.version)

import matplotlib
print(matplotlib.__version__)

%spark.pyspark 
import sys 
print(sys.version) 

import matplotlib 
print(matplotlib.__version__)

The code snippet returns output similar to the following:

2.7.14 |Anaconda, Inc.| (default, Oct 27 2017, 18:21:12) 
[GCC 7.2.0]
2.1.0

3.6.3 |Anaconda, Inc.| (default, Oct 27 2017, 19:41:01) 
[GCC 7.2.0]
2.1.0

The minor versions of Python and matplotlib may differ depending on the versions you install.