Getting Started with Hive-HBase Integration

In this tutorial you will:

  • Create a Hive table
  • Populate the Hive table with data from a text file
  • Query the Hive table
  • Create a Hive-HBase table
  • Introspect the Hive-HBase table from HBase
  • Populate the Hive-Hbase table with data from the Hive table
  • Query the Hive-HBase table from Hive
  • Convert an existing HBase table into a Hive-HBase table

Be sure that you have successfully completed all the steps in the Install and Configure Hive and HBase section before beginning this Getting Started tutorial. This Getting Started tutorial closely parallels the Hive-HBase Integration section of the Apache Hive Wiki, and thanks to Samuel Guo and other contributors to that effort.

Create a Hive table with two columns:

Change to your Hive installation directory if you're not already there and start Hive:

$ cd $HIVE_HOME
$ bin/hive

Execute the CREATE TABLE command to create the Hive pokes table:

hive> CREATE TABLE pokes (foo INT, bar STRING);

To see if the pokes table has been created successfully, execute the SHOW TABLES command:

hive> SHOW TABLES;
OK
pokes
Time taken: 0.74 seconds

The pokes table appears in the list of tables.

Populate the Hive pokes table with data

Execute the LOAD DATA LOCAL INPATH command to populate the Hive pokes table with data from the kv1.txt file.

The kv1.txt file is provided in the $HIVE_HOME/examples directory.

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

A message appears confirming that the table was created successfully, and the Hive prompt reappears:

Copying data from file:
...
OK
Time taken: 0.278 seconds
hive>

Execute a SELECT query on the Hive pokes table:

hive> SELECT * FROM pokes WHERE foo = 98;

The SELECT statement executes, runs a MapReduce application, and prints the job output:

OK
98      val_98
98      val_98
Time taken: 18.059 seconds

The output of the SELECT command displays two identical rows because there are two identical rows in the Hive pokes table with a key of 98. Note: This is a good illustration of the concept that Hive tables can have multiple identical keys. As we will see shortly, HBase tables cannot have multiple identical keys, only unique keys.

To create a Hive-HBase table, enter these four lines of code at the Hive prompt:

hive> CREATE TABLE hbase_table_1(key int, value string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
    > TBLPROPERTIES ("hbase.table.name" = "xyz");

After a brief delay, a message appears confirming that the table was created successfully:

OK
Time taken: 5.195 seconds

Note: The TBLPROPERTIES command is not required, but those new to Hive-HBase integration may find it easier to understand what's going on if Hive and HBase use different names for the same table.

In this example, Hive will recognize this table as "hbase_table_1" and HBase will recognize this table as "xyz".

Start the HBase shell:

Keeping the Hive terminal session open, start a new terminal session for HBase, then start the HBase shell:

$ cd $HBASE_HOME
$ bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.4, rUnknown, Wed Nov  9 17:35:00 PST 2011

hbase(main):001:0>

Execute the list command to see a list of HBase tables:

hbase(main):001:0> list
TABLE
xyz
1 row(s) in 0.8260 seconds

HBase recognizes the Hive-HBase table named xyz. This is the same table known to Hive as hbase_table_1.

Display the description of the xyz table in the HBase shell:

hbase(main):004:0> describe "xyz"
DESCRIPTION                                                                       ENABLED
 {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', BLOOMFILTER => 'NONE', REPLICATI true
 ON_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BL
 OCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0190 seconds

From the Hive prompt, insert data from the Hive table pokes into the Hive-HBase table hbase_table_1

hive> INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98;
...
2 Rows loaded to hbase_table_1
OK
Time taken: 13.384 seconds

Query hbase_table_1 to see the data we have inserted into the Hive-HBase table:

hive> SELECT * FROM hbase_table_1;
OK
98      val_98
Time taken: 0.56 seconds

Even though we loaded two rows from the Hive pokes table that had the same key of 98, only one row was actually inserted into hbase_table_1. This is because hbase_table_1 is an HBASE table, and although Hive tables support duplicate keys, HBase tables only support unique keys. HBase tables arbitrarily retain only one key, and will silently discard all the data associated with duplicate keys.

Convert a pre-existing HBase table to a Hive-HBase table

To convert a pre-existing HBase table to a Hive-HBase table, enter the following four commands at the Hive prompt.

Note that in this example the existing HBase table is my_hbase_table.

hive> CREATE EXTERNAL TABLE hbase_table_2(key int, value string)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val")
    > TBLPROPERTIES("hbase.table.name" = "my_hbase_table");

Now we can run a Hive query against the pre-existing HBase table my_hbase_table that Hive sees as hbase_table_2:

hive> SELECT * FROM hbase_table_2 WHERE key > 400 AND key < 410;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
...
OK
401     val_401
402     val_402
403     val_403
404     val_404
406     val_406
407     val_407
409     val_409
Time taken: 9.452 seconds