mapred-site.xml

Lists the parameters for MapReduce configuration.

MapReduce is a type of application that can run on the Hadoop 2.x framework. MapReduce configuration options are stored in the /opt/mapr/hadoop/hadoop-2.x.x/etc/hadoop/mapred-site.xml file and are editable by the root user. This file contains configuration information that overrides the default values for MapReduce parameters. Overrides of the default values for core configuration properties are stored in the MapR Data Platform Parameters file.

To override a default value for a property, specify the new value within the <configuration> tags, using the following format:

<property>
 <name> </name>
 <value> </value>
 <description> </description>
</property>

Configurations for MapReduce Applications

The configuration comprises the following parameters:

mapreduce.framework.name
Value: yarn
Description: Execution framework set to Hadoop YARN.
mapreduce.input.fileinputformat.split.maxblocknum
Value: 0
Description: Number of blocks that can be added to one split. A value of 0 means that a single split is generated per node.

This functionality requires a patch. To install patches, see Applying a Patch.

mapreduce.map.memory.mb
Value: 1024
Description: Larger resource limit for maps.
mapreduce.map.java.opts
Value: ‑Xmx1024M
Description: Larger heap-size for child jvms of maps.
mapreduce.reduce.memory.mb
Value: 3072
Description: Larger resource limit for reduces.
mapreduce.reduce.java.opts
Value: -Xmx2560m
Description: Larger heap-size for child jvms of reduces.
mapreduce.task.io.sort.mb
Value: 512
Description: Higher memory limit while sorting data for efficiency.
mapreduce.task.io.sort.factor
Value: 100
Description: More streams merged at once while sorting files.
mapreduce.reduce.shuffle.parallelcopies
Value: 50
Description: Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.

Configurations for MapReduce JobHistory Server

The configuration comprises the following parameters:
mapr.localspill.expiration.date
Value: days
Description: Property to determine spill files expiration date in days. Default value is 30 days.
mapreduce.jobhistory.address
Value: MapReduce JobHistory Server host:port
Description: Default port is 10020.
mapreduce.jobhistory.webapp.address
Value: MapReduce JobHistory Server Web UI host:port
Description: Default port is 19888.
mapreduce.jobhistory.intermediate-done-dir
Value: /mr-history/tmp
Description: Directory where history files are written by MapReduce applications.
mapreduce.jobhistory.intermediate-done-scan-timeout
Value: milliseconds
Description: Timeout in milliseconds for rescanning the done_intermediate user directory to reduce JobHistory Server loading. Information about a job is received with a delay equal to the timeout. Adjust the setting based on the cluster load. Start with 5000 ms and increase timeout as needed.
NOTE This functionality requires a patch. To install patches, see Applying a Patch.
mapreduce.jobhistory.done-dir
Value: /mr-history/done
Description: Directory where history files are managed by the MapReduce JobHistory Server.
mapreduce.jobhistory.webapp.https.address
Value: Secure MapReduce JobHistory Server Web UI host:port (HTTPS)
Description: Default port is 19890.

Sample Hadoop 2.x mapred-site.xml File

The following mapred-site.xml file defines values for two job history parameters.

<configuration>
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>__HS_IP__:10020</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>__HS_IP__:19888</value>
  </property>
</configuration>

Configuration for Apache Shuffle

You can disable Direct Shuffle and enable Apache Shuffle for MapReduce applications through the following settings:
mapreduce.job.shuffle.provider.services
Value: mapreduce_shuffle
mapreduce.job.reduce.shuffle.consumer.plugin.class
Value: org.apache.hadoop.mapreduce.task.reduce.Shuffle
mapreduce.job.map.output.collector.class
Value: org.apache.hadoop.mapred.MapTask$MapOutputBuffer
mapred.ifile.outputstream
Value: org.apache.hadoop.mapred.IFileOutputStream
mapred.ifile.inputstream
Value: org.apache.hadoop.mapred.IFileInputStream
mapred.local.mapoutput
Value: true
mapreduce.task.local.output.class
Value: org.apache.hadoop.mapred.YarnOutputFiles