Installing Custom Packages for PySpark Using Conda
To install custom packages for Python 2 (or Python 3) using Conda, you must create a
custom Conda environment and pass the path of the custom environment in your docker
run
command.
Prerequisites
To install Conda, follow the instructions at https://conda.io/docs/user-guide/install/index.html.
For each step of the following steps, select the tab corresponding to the Python version you want to install.
Procedure
-
Create your custom Conda environment and archive it as a zip archive.
The following example creates a custom Conda environment with Python 2 and three packages (
matplotlib
,numpy
, andpandas
):mkdir custom_pyspark_env conda create -p ./custom_pyspark_env python=2 numpy pandas matplotlib cd custom_pyspark_env zip -r custom_pyspark_env.zip ./
The following example creates a custom Conda environment with Python 3 and three packages (matplotlib
,numpy
, andpandas
):mkdir custom_pyspark3_env conda create -p ./custom_pyspark3_env python=3 numpy pandas matplotlib cd custom_pyspark3_env zip -r custom_pyspark3_env.zip ./
IMPORTANT Do not create an archive namedpyspark.zip
. This name is reserved for PySpark internals. -
Launch the Zeppelin container, specifying the path of the Python archive in
your
docker run
command.You can specify the archive in one of the following ways:
- Option 1: Specify the archive from the MapR File System by uploading the archive to the MapR File System
- Option 2: Specify the archive from your local filesystem using a Docker mount point
- Option 1
-
hadoop fs -put custom_pyspark_env.zip /python_envs/custom_pyspark_env.zip docker run -it ... \ -e ZEPPELIN_ARCHIVE_PYTHON=/python_envs/custom_pyspark_env.zip \ ... \ maprtech/data-science-refinery:v1.4.1_6.1.0_6.3.0_centos7
- Option 2
-
docker run -it ... \ -v /local/path/custom_pyspark_env.zip:/tmp/custom_pyspark_env.zip:ro \ -e ZEPPELIN_ARCHIVE_PYTHON=/tmp/custom_pyspark_env.zip \ ... \ maprtech/data-science-refinery:v1.4.1_6.1.0_6.3.0_centos7
The path parameters in the sample command correspond to the following:
Full Path to Archive from Step 1 Mount Point of the Archive in your Container /local/path/custom_pyspark_env.zip
/tmp/custom_pyspark_env.zip
If you want to use Python 3 instead of Python 2, set >>>>>>> Brought back DSR 1.3 content
ZEPPELIN_ARCHIVE_PYTHON
in one of the following ways:- Option 1: Specify the archive from MapR File System by uploading the archive to MapR File System
- Option 2: Specify the archive from your local file system using a Docker mount point
- Option 1
-
hadoop fs -put custom_pyspark3_env.zip /python_envs/custom_pyspark3_env.zip docker run -it ... \ -e ZEPPELIN_ARCHIVE_PYTHON=/python_envs/custom_pyspark3_env.zip \ ... \ maprtech/data-science-refinery:v1.4.1_6.1.0_6.3.0_centos7
- Option 2
-
docker run -it ... \ -v /local/path/custom_pyspark3_env.zip:/tmp/custom_pyspark3_env.zip:ro \ -e ZEPPELIN_ARCHIVE_PYTHON=/tmp/custom_pyspark3_env.zip \ ... \ maprtech/data-science-refinery:v1.4.1_6.1.0_6.3.0_centos7
The path parameters in the sample command correspond to the following:
Full Path to Archive from Step 1 Mount Point of the Archive in your Container /local/path/custom_pyspark3_env.zip
/tmp/custom_pyspark3_env.zip
-
To verify that you have successfully installed the
matplotlib
package, run the following code snippet in your Zeppelin UI:%livy.pyspark import sys print(sys.version) import matplotlib print(matplotlib.__version__)
%spark.pyspark import sys print(sys.version) import matplotlib print(matplotlib.__version__)
The code snippet returns output similar to the following:
2.7.14 |Anaconda, Inc.| (default, Oct 27 2017, 18:21:12) [GCC 7.2.0] 2.1.0
3.6.3 |Anaconda, Inc.| (default, Oct 27 2017, 19:41:01) [GCC 7.2.0] 2.1.0
The minor versions of Python and
matplotlib
may differ depending on the versions you install.