Hive Migration

You can continue to use Hive tables in a Data Fabric cluster.

About this task

Hive facilitates the analysis of large datasets stored in the Hadoop file system by organizing that data into tables that can be queried and analyzed using a dialect of SQL called HiveQL. The schemas that define these tables and all other Hive metadata are stored in a centralized repository called the metastore.

If you would like to continue using Hive tables developed on an HDFS cluster in a Data Fabric cluster, you can import Hive metadata from the metastore to recreate those tables in Data Fabric. Depending on your needs, you can choose to import a subset of table schemas or the entire metastore in one go:
  • Importing table schemas into a Data Fabric cluster: Use this procedure to import a subset of the Hive metastore from an HDFS cluster to a Data Fabric cluster. This method is preferred when you want to test a subset of applications using a smaller subset of data.
  • Importing an entire Hive metastore into a Data Fabric cluster: Use the following procedure to import an entire Hive metastore from an HDFS cluster to a Data Fabric cluster. This method is preferred when you want to test all applications using a complete dataset. You will need to redirect all of links that formerly pointed to the HDFS (hdfs://<namenode>:<port number>/<path>) to point to file system (maprfs:///<path>).

    MySQL is a very popular choice for the Hive metastore and is used in the following example. If you are using another RDBMS, consult the relevant documentation.

Procedure

  1. Ensure that both Hive and your database are installed on one of the nodes in the Data Fabric cluster. For step-by-step instructions on setting up a standalone MySQL metastore, see Using MySQL for the Hive Metastore.
  2. On the HDFS cluster, back up the metastore to a file.
    mysqldump [options] \--databases db_name... > filename
  3. Ensure that queries in the dumpfile point to the file system rather than HDFS. Search the dumpfile and edit all of the URIs that point to hdfs:// so that they point to maprfs:/// instead.
  4. Import the data from the dumpfile into the metastore running on the node in the Data Fabric cluster:
    mysql [options] db_name < filename

What to do next

Using Hive with Data Fabric volumes

file system does not allow moving or renaming across volume boundaries. Be sure to set the Hive Scratch Directory and Hive Warehouse Directory in the same volume where the data for the Hive job resides before running the job. For more information, see How Hive Handles Scratch Directories on Data Fabric in Hive Directories.