Apache Shuffle on YARN

You can disable Direct Shuffle and enable Apache Shuffle by modifying the configuration options in the yarn-site.xml and mapred-site.xml files. This page describes how to configure Apache Shuffle for MapReduce applications.

The shuffling phase in Hadoop is the process of transferring mappers intermediate output to the reducers. Direct shuffle increases the load on file system disks. You can enable the Apache Shuffle to reduce the load on file system disks.

Configuration for Apache Shuffle

Add the following property to yarn-site.xml file:
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property> 

Add the following properties to mapred-site.xml file:

 <property>
    <name>mapreduce.job.shuffle.provider.services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>mapreduce.job.reduce.shuffle.consumer.plugin.class</name>
    <value>org.apache.hadoop.mapreduce.task.reduce.Shuffle</value>
</property>
<property>
    <name>mapreduce.job.map.output.collector.class</name>
    <value>org.apache.hadoop.mapred.MapTask$MapOutputBuffer</value>
</property>
<property>
    <name>mapred.ifile.outputstream</name>
    <value>org.apache.hadoop.mapred.IFileOutputStream</value>
</property>
<property>
    <name>mapred.ifile.inputstream</name>
    <value>org.apache.hadoop.mapred.IFileInputStream</value>
</property>
<property>
    <name>mapred.local.mapoutput</name>
    <value>true</value>
</property>
<property>
    <name>mapreduce.task.local.output.class</name>
    <value>org.apache.hadoop.mapred.YarnOutputFiles</value>
</property>