Drill Limitations

Max Drill Query Size Depends on ZooKeeper jute.maxbuffer Value

Issue: Drill cannot run a query that exceeds the ZooKeeper jute.maxbuffer value of 1 MB.

Solution

For the ZooKeeper jute.maxbuffer property, follow the recommendations in the ZooKeeper documentation.

CAUTION The jute.maxbuffer property is marked as an unsafe option. Do not change it to a higher value to run larger queries. For additional details about why this property is unsafe, see the Apache Zookeeper documentation: https://zookeeper.apache.org/doc/r3.6.2/zookeeperAdmin.html

The following error indicates that the query is too large. To resolve this error, rewrite the query to fit within the jute.maxbuffer limit:

Query execution error. Details: EXECUTION_ERROR ERROR: Failed to persist query info. Query length is too big.

In Drill EEP version 8.1.0 and earlier and EEP version 9.1.1 and earlier, the following type of error may also indicate that the query is too large:

Query execution error. Details: SYSTEM ERROR: ConnectionLossException: KeeperErrorCode = ConnectionLoss for /drill/running/1bb44a40-c715-7b38-c310-05de39dfb3e7

In Drill EEP version 8.1.1, this error could indicate that the jute.maxbuffer value set for the Drillbit does not correspond with the ZooKeeper jute.maxbuffer value.

To change the value of jute.maxbuffer in Drill, add the -Djute.maxbuffer Java property to DRILL_JAVA_OPTS in <drill_home>/conf/drill-env.sh, as shown in the following example:

export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS -Djute.maxbuffer=900000

If you change the jute.maxbuffer value, you must update the system property on all servers (ZooKeeper nodes) and clients (Drillbit nodes) to have the same value. Failure to change the value on all servers and clients can result in further errors.

Working with subqueries

The SELECT list in a scalar subquery can only contain one item/column.
Correlated subqueries should return exactly one row.
The WHERE clause of a subquery should not refer to more than one column of the table in the outer query.

Queries on JSON Files and Tables in HPE Ezmeral Data Fabric Return an OutOfMemoryException

An architectural limitation in Apache Drill can cause an overconsumption of memory when Drill queries files or tables with a certain JSON structure, resulting in an OutOfMemoryException.

Drill may return an OutOfMemoryException for queries that run against JSON files stored in HPE Ezmeral Data Fabric Database and HPE Ezmeral Data Fabric File Store that have many key-value pairs if the queries include the key-value pairs.

For example, the following JSON files could cause Drill to return an OutOfMemoryException when queries that include the key-value pairs run against the files:

JSON file with objects that have many key-value pairs

{
    "context" : {
        "1": "a",
        "2": "b",
        "3": "c",
        ... // many key-value pairs; not showing 9996 of them
        "10000": "d"
    }
}
{
    "context" : {
        "10001": "b"
        ... //many key-value pairs; not showing 19998 of them
        "30000": "z"
    }
}
{
    "context" : {
        "3": "c"
    }
}
{
    "context" : {

    }
}
{
    "context" : {
        "5": "e"
    }
}

JSON file with thousands of objects, each having a unique key

{
    "context" : {
        "1": "a"
    }
}
{
    "context" : {
        "2": "b"
    }
}
{
    "context" : {
        "3": "c"
    }
}
{
    "context" : {
        "4": "d"
    }
}
{
    "context" : {
        "5": "e"
    }
}

In this scenario, if a JSON file contains thousands of objects with unique keys (in 30,000+ documents), the following query would cause an OutOfMemoryException:

SELECT context FROM maprfs.`folderWithJSONDocuments`;

Issue Cause

Drill was designed to run queries against massive amounts of data. To successfully run such queries, Drill has a columnar execution engine that works with vectors. Drill creates a separate vector for each unique key and then allocates memory to each vector. Each vector stores about 1,024 values, which varies slightly depending on the data type of the value.

In the following illustration, each key has a VARCHAR value and Drill creates a NullableVarCharVector for each unique key:

Drill allocates 40960 bytes of direct memory to each NullableVarCharVector. You can see how Drill fills each vector with 7 bytes (2 bytes for a single CHAR string, like "a" and 5 bytes for internally used values).

In cases where Drill is querying thousands upon thousands of JSON files, this works well. However, in cases where Drill queries a single file, a memory issue occurs because each key-value pair in the JSON file may consume more than 1000x more memory than is required for the corresponding value. Each vector unnecessarily holds memory for several values, resulting in failed queries due to a memory shortage.

Refer to Value Vectors for more information about vectors in Drill.

Issue Resolution

To resolve or prevent this issue, change the format of the key-value pairs in the JSON file from an object to an array of objects, as shown in the following example:

{
  "context" : [
    {
      "key": "1",
      "value": "a"
    },
    {
      "key": "2",
      "value": "b"
    },
    {
      "key": "3",
      "value": "c"
    },
    ...,
    {
      "key": "100000",
      "value": "z"
    }
  ]
}

When a JSON file has an array of objects, Drill only creates two NullableVarCharVectors – one for the "key" and one for the "value". With this structure, only two vectors need to hold memory.

In the following illustration, you can see how Drill fills two vectors with many values versus filling thousands of vectors with only a few values:

Unequal JOIN Support

By default, unequal joins are disabled in Drill because Drill does not have optimized JOIN operators to process these types of queries. An unequal join contains a condition with an inequality comparison between columns. An unequal join uses operators, such as <>, <, >, <=, or >= to establish the relationship between the joined columns. For example:

SELECT *
  FROM table1
  JOIN table2 
  ON table1.column_name <= table2.column_name;

By default, if you run a query with an unequal join, the query fails and Drill returns the following error:

Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due to either a cartesian join or an inequality join.
If a cartesian or inequality join is used intentionally, set the option 'planner.enable_nljoin_for_scalar_only' to false and try again.

Although Drill does not have optimized JOIN operators, Drill can process these types of queries in a non-efficient way using a nested loop join operator. To enable this, set the planner.enable_nljoin_for_scalar_only option to false. Drill can then execute LEFT and INNER joins; however, RIGHT joins will still fail with the error message shown.

Even when planner.enable_nljoin_for_scalar_only is disabled, Drill may still return an UNSUPPORTED_OPERATION ERROR for LEFT joins due to the join optimizations in Drill. If Drill detects that the right input is larger than the left, Drill optimizes the join such that the left and right inputs are flipped and the LEFT join type will be changed to RIGHT. If the query contains non-equi joins, after such optimizations, the query will fail because the nested loop join operator does not allow RIGHT joins.

Using the following query as an example:

SELECT *
  FROM table1
  LEFT JOIN table2 
  ON table1.column_name <= table2.column_name;

If table1 is smaller than table2 or even empty, Drill applies a join optimization on the query and the query fails with the following error:

Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due to either a cartesian join or an inequality join.
If a cartesian or inequality join is used intentionally, set the option 'planner.enable_nljoin_for_scalar_only' to false and try again.

The same optimization can make a RIGHT join executable when the left input is smaller than the right. In this case, the left and right inputs will be flipped and the RIGHT join type will be changed to LEFT. The nested loop operator supports LEFT joins. This is how Drill executes a query with a RIGHT join and no errors.

To avoid non-equi join errors, disable the join optimization by setting the planner.enable_join_optimization option to false.

IMPORTANT Join optimizations improve query performance. If you disable the planner.enable_join_optimization option, it disables optimizations for all join types. If you need to disable this option for non-equi joins, only disable the planner.enable_join_optimization option at the session level using the ALTER SESSION SET command or through the Drill Web UI.