Integrate Spark with R

You integrate Spark with R when you want to run R programs as Spark jobs.

About this task

Procedure

  1. On each node that will submit Spark jobs, install R 3.2.2 or greater:
    • On Ubuntu:
      apt-get install r-base-dev
    • On CentOS/RedHat:
      yum install R

    For more information about installing R, see the R documentation.

  2. To verify the integration, run the following commands as the mapr user or as a user that mapr impersonates:
    1. Start Spark R:
      • On Spark 2.0.1, 2.1.0, and later:
        /opt/mapr/spark/spark-<version>/bin/sparkR --master <master> [--deploy-mode <deploy-mode>]
      • On Spark 1.6.1:
        /opt/mapr/spark/spark-<version>/bin/sparkR --master <master-url>
    2. Run the following command to create a DataFrame using sample data:
      On Spark 1.6.1:
      people <- read.df(sqlContext, "file:///opt/mapr/spark/spark-<version>/examples/src/main/resources/people.json", "json")
      On Spark 2.0.1, 2.1.0, and later:
      people <- read.df(spark, "file:///opt/mapr/spark/spark-<version>/examples/src/main/resources/people.json", "json")
    3. Run the following command to display the data from the DataFrame that you just created:
      head(people)