Configure Chunk Size

Chunk size affects parallel processing and random disk I/O during MapReduce jobs. A higher chunk size means less parallel processing because there are fewer map inputs, and therefore fewer mappers. A lower chunk size improves parallelism, but results in higher random disk I/O during shuffle because there are more map outputs. Set the io.sort.mb parameter to a value between 120% and 150% of the chunk size.

Here are general guidelines for chunk size:

  • For most purposes, set the chunk size to the default 256 MB and set the value of the io.sort.mb parameter to the default 380 MB.
  • On very small clusters or nodes with not much RAM, set the chunk size to 128 mb and set the value of the io.sort.mb parameter to 190 MB.
  • If application-level compression is in use, the io.sort.mb parameter should be at least 380MB.