Configure Chunk Size
Chunk size affects parallel processing and random disk I/O
during MapReduce jobs. A higher chunk size means less parallel
processing because there are fewer map inputs, and therefore fewer
mappers. A lower chunk size improves parallelism, but results in
higher random disk I/O during shuffle because there are more map
outputs. Set the io.sort.mb
parameter to a value
between 120% and 150% of the chunk size.
Here are general guidelines for chunk size:
- For most purposes, set the chunk size to the default 256 MB and
set the value of the
io.sort.mb
parameter to the default 380 MB.
- On very small clusters or nodes with not much RAM, set the
chunk size to 128 mb and set the value of the
io.sort.mb
parameter to 190 MB.
- If application-level compression is in use, the
io.sort.mb
parameter should be at least 380MB.