Configuring Conda Python for Zeppelin

Describes how to configure Conda Python for Zeppelin.

The following steps assume that the miniconda distribution of Conda Python is already installed. For more information see the Conda documentation.

Use these steps:
  1. Create a Conda zip archive containing Python and all the libraries that you need.
    The following example creates a custom Conda environment with Python 2 and three packages (matplotlib, numpy, and pandas):
    mkdir custom_pyspark_env
    conda create -p ./custom_pyspark_env python=2 numpy pandas matplotlib
    cd custom_pyspark_env
    zip -r custom_pyspark_env.zip ./
    The following example creates a custom Conda environment with Python 3 and three packages (matplotlib, numpy, and pandas):
    mkdir custom_pyspark3_env
    conda create -p ./custom_pyspark3_env python=3 numpy pandas matplotlib
    cd custom_pyspark3_env
    zip -r custom_pyspark3_env.zip ./
    IMPORTANT Do not create an archive named pyspark.zip. This name is reserved for PySpark internals.
  2. Upload the archive to the data-fabric file system. For example, if the archive name is custom_pyspark_env.zip, and you want to put the archive in a directory that all users can read:
    hadoop fs -mkdir /apps/zeppelin  
    hadoop fs -put custom_pyspark_env.zip /apps/zeppelin
  3. Add the full path (including maprfs:// schema) to the archive into spark.yarn.dist.archive, and configure the Spark / Livy interpreter to use Python from this distribution.

    Note that all archives listed in the property will be extracted into a working directory of YARN application.

  4. For the Spark interpreter, set the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables (it can be done by configuring Spark interpreter):
  5. For the Livy interpreter, set the livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON property: