Integrate Pig and HBase

This document shows an example of a Pig and HBase integration. The goal of integration is to upload data from the file system to Pig and then move the data to an HBase table.

IMPORTANT This component is deprecated. Hewlett Packard Enterprise recommends using an alternate product. Deprecated components are either in maintenance or have reached the end of their maintenance lifecycle. For more information, see Discontinued Ecosystem Components.

Configuring Pig and HBase

No additional configuration is needed to integrate HBase and Pig.

Pig and HBase Integration Example

  1. Create sample data, and upload the data to the file system:
    1. Create a sample data file:
      vim input.csv
    2. Add data to the file:
      1,aaa,bbb
      2,ccc,ddd
      3,rrr,fff
      4,ttt,yyy
    3. Upload the data to the file system:
      hadoop fs -put input.csv /user/mapr/input.csv
  2. Create a sample table in HBase:
    1. Start the HBase shell:
      hbase shell
    2. Create a table:
      hbase(main):012:0> create 'sample_names', 'info'
  3. Load the data to Pig, and store the data in HBase:
    1. Start the Pig shell:
      pig
    2. Load the data to Pig:
      raw_data = LOAD '/user/mapr/input.csv' USING PigStorage(',') AS (listing_id: chararray, fname: chararray, lname: chararray);
    3. Store the data in HBase:
      STORE raw_data INTO 'sample_names' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('info:fname info:lname');
  4. Verify the data in HBase:
    1. Start the HBase shell:
      hbase shell
    2. Query the data:
      hbase(main):017:0* scan 'sample_names'
    The result is:
    ROW                         	COLUMN+CELL                                                                                                                                             	 
     1	column=info:fname, timestamp=1574946889082, value=aaa
     1	column=info:lname, timestamp=1574946889082, value=bbb
     2	column=info:fname, timestamp=1574946889091, value=ccc
     2	column=info:lname, timestamp=1574946889091, value=ddd
     3	column=info:fname, timestamp=1574946889091, value=rrr
     3	column=info:lname, timestamp=1574946889091, value=fff  
     4	column=info:fname, timestamp=1574946889091, value=ttt
     4	column=info:lname, timestamp=1574946889091, value=yyy