Manually Installing Custom Packages for PySpark

Use the Python package manager, pip (or pip3 for PySpark3), to manually install custom packages on each node in your MapR Data Platform cluster. You need administrative access on your cluster nodes to install the packages.

Procedure

  1. Install the package manager using one of the following commands, depending on your operating system:
    1. RedHat:
      sudo yum install -y python-devel python-setuptools
      sudo easy_install pip
      sudo yum install -y python34-devel python34-setuptools
      sudo easy_install-3.4 pip
    2. SLES:
      sudo zypper install -y python-devel python-setuptools
      sudo easy_install pip
      sudo zypper install -y python3-devel python3-setuptools
      sudo easy_install-3.4 pip
    3. Ubuntu:
      sudo apt-get install -y python-dev python-setuptools
      sudo easy_install pip
      sudo apt-get install -y python3-dev python3-setuptools
      sudo easy_install3 pip
  2. Install the custom package using the utility you downloaded in the first step above.

    The following example installs the matplotlib package:

    sudo pip install matplotlib
    sudo pip3 install matplotlib

    You must install the package on each node in your MapR cluster where PySpark jobs will run. These are the nodes that contain a YARN NodeManager.

  3. To verify successful installs, run the following code snippet in your Zeppelin UI:
    %livy.pyspark
    
    import sys
    print(sys.version)
    
    import matplotlib
    print(matplotlib.__version__)

    The code snippet returns output similar to the following:

    2.7.5 (default, Nov 6 2016, 00:28:07) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]
    2.1.0
    3.4.5 (default, May 29 2017, 15:17:55) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]
    2.1.0

    The minor versions of Python and matplotlib may differ depending on the versions you install.