Storage Plugins Overview

The Hadoop cluster within the sandbox is set up with MapR File System, MapR Database, and Hive, which all serve as data sources for Drill in this tutorial. Before you can run queries against these data sources, you need to connect to the data source through an interface called a storage plugin. A storage plugin defines interfaces to read/write and get metadata from the data source. Each storage plugin also exposes optimization rules for Drill to leverage for efficient query execution.

Jump directly to the queries in Lesson 1: Learn About the Data Set, or first, get some important background information about pre-configured storage plugins by following these steps:

  1. Start the Drill Web Console.
  2. Go to the Storage tab.
  3. Open the configured storage plugins one at a time by clicking Update. You will see the following plugins configured.

dfs

This is a storage plugin configuration for the MapR File System in the sandbox. The connection attribute indicates the type of distributed filesystem: in this case, MapR File System. Drill can work with any distributed system, including HDFS, S3, and so on.

The configuration also includes a set of workspaces; each one represents a location in MapR File System:

  • root: access to the root filesystem location

  • clicks: access to nested JSON log data

  • logs: access to flat (non-nested) JSON log data in the logs directory and its subdirectories

  • views: a workspace for creating views

A workspace in Drill is a location where users can easily access a specific set of data and collaborate with each other by sharing artifacts. Users can create as many workspaces as they need within Drill.

Each workspace can also be configured as “writable” or not, which indicates whether users can write data to this location and defines the storage format in which the data will be written (parquet, csv, json). These attributes become relevant when you explore Drill SQL commands, especially CREATE TABLE AS (CTAS) and CREATE VIEW.

Drill can query files and directories directly and can detect the file formats based on the file extension or the first few bits of data within the file. However, additional information around formats is required for Drill, such as delimiters for text files, which are specified in the “formats” section as follows.

{
 "type": "file",
 "enabled": true,
 "connection": "maprfs:///",
 "workspaces": {
   "root": {
     "location": "/mapr/demo.mapr.com/data",
     "writable": false,
     "storageformat": null
   },
   "clicks": {
     "location": "/mapr/demo.mapr.com/data/nested",
     "writable": true,
     "storageformat": "parquet"
   },
   "logs": {
     "location": "/mapr/demo.mapr.com/data/flat",
     "writable": true,
     "storageformat": "parquet"
   },
   "views": {
     "location": "/mapr/demo.mapr.com/data/views",
     "writable": true,
     "storageformat": "parquet"
 },
 "formats": {
   "psv": {
     "type": "text",
     "extensions": [
       "tbl"
     ],
     "delimiter": "|"
   },
   "csv": {
     "type": "text",
     "extensions": [
       "csv"
     ],
     "delimiter": ","
   },
   "tsv": {
     "type": "text",
     "extensions": [
       "tsv"
     ],
     "delimiter": "\t"
   },
   "parquet": {
     "type": "parquet"
   },
   "json": {
     "type": "json"
   }
 }
}

hive

A storage plugin configuration for a Hive data warehouse within the sandbox. Drill connects to the Hive metastore by using the configured metastore thrift URI. Metadata for Hive tables is automatically available for users to query.

{
 "type": "hive",
 "enabled": true,
 "configProps": {
   "hive.metastore.uris": "thrift://localhost:9083",
   "hive.metastore.sasl.enabled": "false"
 }
}