Installing Custom Packages for PySpark Using Conda

To install custom packages for Python 2 (or Python 3) using Conda, you must create a custom Conda environment and pass the path of the custom environment in your docker run command.

Prerequisites

To install Conda, follow the instructions at https://conda.io/docs/user-guide/install/index.html.

For each step of the following steps, select the tab corresponding to the Python version you want to install.

Procedure

  1. Create your custom Conda environment and archive it as a zip archive.
    The following example creates a custom Conda environment with Python 2 and three packages (matplotlib, numpy, and pandas):
    mkdir custom_pyspark_env
    conda create -p ./custom_pyspark_env python=2 numpy pandas matplotlib
    cd custom_pyspark_env
    zip -r custom_pyspark_env.zip ./
    The following example creates a custom Conda environment with Python 3 and three packages (matplotlib, numpy, and pandas):
    mkdir custom_pyspark3_env
    conda create -p ./custom_pyspark3_env python=3 numpy pandas matplotlib
    cd custom_pyspark3_env
    zip -r custom_pyspark3_env.zip ./
    IMPORTANT Do not create an archive named pyspark.zip. This name is reserved for PySpark internals.
  2. Launch the Zeppelin container, specifying the path of the Python archive in your docker run command.

    You can specify the archive in one of the following ways:

    • Option 1: Specify the archive from the MapR File System by uploading the archive to the MapR File System
    • Option 2: Specify the archive from your local filesystem using a Docker mount point

    Option 1
    hadoop fs -put custom_pyspark_env.zip /python_envs/custom_pyspark_env.zip
    docker run -it ... \
       -e ZEPPELIN_ARCHIVE_PYTHON=/python_envs/custom_pyspark_env.zip \ 
       ... \
       maprtech/data-science-refinery:v1.4.1_6.1.0_6.3.0_centos7
    Option 2
    docker run -it ... \
       -v /local/path/custom_pyspark_env.zip:/tmp/custom_pyspark_env.zip:ro \
       -e ZEPPELIN_ARCHIVE_PYTHON=/tmp/custom_pyspark_env.zip \ 
       ... \
       maprtech/data-science-refinery:v1.4.1_6.1.0_6.3.0_centos7

    The path parameters in the sample command correspond to the following:

    Full Path to Archive from Step 1 Mount Point of the Archive in your Container
    /local/path/custom_pyspark_env.zip /tmp/custom_pyspark_env.zip

    If you want to use Python 3 instead of Python 2, set >>>>>>> Brought back DSR 1.3 content ZEPPELIN_ARCHIVE_PYTHON in one of the following ways:

    • Option 1: Specify the archive from MapR File System by uploading the archive to MapR File System
    • Option 2: Specify the archive from your local file system using a Docker mount point

    Option 1
    hadoop fs -put custom_pyspark3_env.zip /python_envs/custom_pyspark3_env.zip 
    docker run -it ... \ 
       -e ZEPPELIN_ARCHIVE_PYTHON=/python_envs/custom_pyspark3_env.zip \
       ... \
       maprtech/data-science-refinery:v1.4.1_6.1.0_6.3.0_centos7
    Option 2
    docker run -it ... \
       -v /local/path/custom_pyspark3_env.zip:/tmp/custom_pyspark3_env.zip:ro \
       -e ZEPPELIN_ARCHIVE_PYTHON=/tmp/custom_pyspark3_env.zip \ 
       ... \
       maprtech/data-science-refinery:v1.4.1_6.1.0_6.3.0_centos7

    The path parameters in the sample command correspond to the following:

    Full Path to Archive from Step 1 Mount Point of the Archive in your Container
    /local/path/custom_pyspark3_env.zip /tmp/custom_pyspark3_env.zip
  3. To verify that you have successfully installed the matplotlib package, run the following code snippet in your Zeppelin UI:
    %livy.pyspark
    
    import sys
    print(sys.version)
    
    import matplotlib
    print(matplotlib.__version__)
    %spark.pyspark 
    import sys 
    print(sys.version) 
    
    import matplotlib 
    print(matplotlib.__version__) 

    The code snippet returns output similar to the following:

    2.7.14 |Anaconda, Inc.| (default, Oct 27 2017, 18:21:12) 
    [GCC 7.2.0]
    2.1.0
    3.6.3 |Anaconda, Inc.| (default, Oct 27 2017, 19:41:01) 
    [GCC 7.2.0]
    2.1.0

    The minor versions of Python and matplotlib may differ depending on the versions you install.