MapR-DB JSON MapReduce API Library
This API library extends the Apache Hadoop MapReduce framework, so that you can write your own MapReduce applications to write data from one JSON table to another.
Prerequisites to Using this API Library
- Ensure that you have a firm grasp of MapReduce concepts and experience writing MapReduce applications.
- Before running a MapReduce application that uses this API, ensure that the destination JSON table or tables already exist and that any column families other than the default are already created on the destination tables.
Classes
The following table summarizes the information that is in the javadoc, which you can refer to for complete details of the classes.
Category | Class | Description |
---|---|---|
Utility | MapRDBMapReduceUtil |
Simplifies the use of the API for most use cases. |
Input formatters | TableInputFormat |
Describes how to read documents from MapR-DB JSON tables. |
Record reader | TableRecordReader |
Reads documents (records) from MapR-DB JSON tables. |
Record writer - bulk load | BulkLoadRecordWriter |
Bulk loads documents into MapR-DB JSON tables. |
Record writer - table mutation | TableMutationRecordWriter |
Modifies documents that are in MapR-DB JSON tables. |
Record writer - table | TableRecordWriter |
Writes documents to MapR-DB JSON tables. |
Output formatter - bulk load | BulkLoadOutputFormat |
Describes how to bulk load documents into MapR-DB JSON tables. |
Output formatter - table | TableOutputFormat |
Describes how to write documents to MapR-DB JSON tables. |
Serializer - document | DocumentSerialization |
Defines the serializer and deserializer for passing data from
Document objects between map and reduce phases. |
Serializer - mutation | MutationSerialization |
Defines the serializer and deserializer for passing data from DocumentMutation objects
between map and reduce phases. |
Partitioner - table | TablePartitioner |
Specifies how to partition data from the source JSON table. |
Partitioner - total order | TotalOrderPartitioner<K,V> |
Globally sorts data according to row key and then partitions the sorted data. This class is useful when the destination table has been pre-split into two or more tablets. |
Using MapRDBMapReduceUtil to Set Default Values in Configurations and Jobs
The centerpiece of this API is the MapRDBMapReduceUtil
class, which
you can use in the createSubmittableJob()
method of your applications to
perform these actions:
- Set default values in the configuration for a MapReduce job and set the input and output
format classes. You can do so with these methods:
configureTableInputFormat(org.apache.hadoop.mapreduce.Job job, String srcTable)
- This method performs these actions:
- Set the serialization class for
Document
andValue
objects. These interfaces are part of the OJAI (Open JSON Application Interface) API. - Set the field
INPUT_TABLE
inTableInputFormat
to the path and name of the source table, and pass this value to the configuration for the MapReduce job. - Set the input format class for the job to
TableInputFormat
.
- Set the serialization class for
configureTableOutputFormat(org.apache.hadoop.mapreduce.Job job, String destTable)
- This method performs these actions:
- Set the field
OUTPUT_TABLE
inTableOutputFormat
to the path and name of the destination table, and pass this value to the configuration for the MapReduce job. - Set the output format class for the job to
TableOutputFormat
.
- Set the field
TableInputFormat
orTableOutputFormat
, or write your own logic for them, you can pass field values to configurations and specify these classes for jobs as you would in common MapReduce applications. - Set default types for output keys and values. You can also set types for output keys and
values from the map phase, if those types will differ from the final output types.
setMapOutputKeyValueClass(org.apache.hadoop.mapreduce.Job job)
setOutputKeyValueClass(org.apache.hadoop.mapreduce.Job job)
- Configure a
TotalOrderPartitioner
and return the number of reduce tasks to use for a job.For example, suppose that in your application's method for creating a job, you include this line:
Theint numReduceTasks = MapRDBMapReduceUtil.setPartitioner( org.apache.hadoop.mapreduce.Jobjob, String destPath);
setPartitioner()
method finds out whether a table has been pre-split into two or more tablets, counts the number of tablets, writes the number to a partitioner file, and sends that file to an instance ofTotalOrderPartitioner
. This line also returns the number of tablets tonumReduceTasks
. Your code can then use that variable to set the number of reducers, like this:job.setNumReduceTasks(numReduceTasks);
MapRDBMapReduceUtil
.Mutating Rows in Destination Tables
Use the MutationSerialization
and TableMutationRecordWriter
classes
when you need to mutate rows.
For example, suppose that you are tracking the number of users who are performing various actions on your retail website. To do this, at intervals you run your MapReduce application and save the results in OJAI documents in MapR-DB. Suppose that you count the number of users who went through the order process but abandoned their orders. After every run of the application, you want to update an OJAI document by adding the current count to the total count and by updating a field that tracks the date and time that the MapReduce application was last run.
You could do that by setting values in a DocumentMutation object (see the javadoc for OJAI (Open JSON Application Interface)).
You would then serialize that and write it to the table with TableMutationRecordWriter
.
Compiling and Running Applications
Compile applications as described in Compiling and Running Applications that Access JSON Tables and Documents.
Speculative execution of MapReduce tasks is on by default. For custom applications that load MapR-DB tables, it is recommended to turn speculative execution off. When it is on, the tasks that import data might run multiple times. Multiple tasks for an incremental bulkload could insert one or more versions of a record into a table. Multiple tasks for a full bulkload could cause loss of data if the source data continues to be updated during the load.
If your
custom MapReduce application uses
MapRDBMapReduceUtil.configureTableOutputFormat()
, you do not have to
turn off speculative execution manually. This method turns it off automatically.
- Set either of the following MapReduce parameters to false, depending on the version
of MapReduce that you are using:
- MRv1:
mapred.map.tasks.speculative.execution
- MRv2:
mapreduce.map.speculative
- MRv1:
- Include the following line in the method in your application that sets parameters
for jobs:
job.setSpeculativeExecution(false);