New Features in Impala 2.5.0 for MapR
Impala 2.5.0 for MapR introduces some new features that enhance the behavior and performance of Impala.
Performance Improvements
- Impala caches file handles to avoid the overhead of repeatedly opening the same file which improves I/O performance.
- Basic query types, such as counting the elements of a complex column, use an optimized code path that improves the performance of queries involving nested complex types.
- Impala uses code generation in certain parts of queries, such as evaluating functions in the WHERE clause, even when code generation is not used in some phases of query execution.
- Using DECIMAL values in a GROUP BY clause triggers code generation optimization, which speeds up queries that group by values, such as prices, and improves performance and reliability of the DECIMAL data type.
- Improved coordination and parallelization between Impala nodes results in faster query startup time for queries that involve tables with many partitions or complex queries with many fragments.
- The coordinator node requires less memory, making it faster and less resource-intensive when performing joins that involve several tables with thousands of partitions.
- Impala only re-evaluates metadata for partitions that are affected by a DDL operation, not all partitions in the table. While a DDL or insert statement is in progress, other Impala statements that attempt to modify metadata for the same table wait until the first one finishes, improving performance and reliability of DDL and insert operations on partitioned tables with a large number of partitions.
- A new query option, OPTIMIZE_PARTITION_KEY_SCANS, speeds up aggregation operations that involve only the partition key columns of partitioned tables. This optimization can produce different results when files in a partition are manually deleted or are empty.
Security
- Column-Level Authorization
-
You can use column-level authorization to define access to particular columns within a table instead of the entire table. Creating views to set up authorization schemes for subsets of information is not required. This functionality requires Sentry 1.6
Column-level authorization has the following syntax:GRANT SELECT(column_name) ON TABLE table_name TO ROLE role_name; REVOKE SELECT(column_name) ON TABLE table_name FROM ROLE role_name;
- LDAP Password Retrieval
-
You can use a new impala-shell command line option,
--ldap_password_cmd
, to retrieve the LDAP password. The resulting password is used to authenticate the impala-shell command with the LDAP server.
Scripting Capability Improvements
Scripting capability improvements for the impala-shell enable you to define substitution
variables and use them in SQL statements that you execute through command-line options. The
--var
command-line option passes key-value pairs to the impala-shell. The
shell substitutes the values into an SQL statement where it contains the notation
${var:varname}
before Impala executes the query.
You can use the SET and UNSET commands in a session to define or clear substitution variables
with the SET|UNSET VAR:variable_name=value
notation or a script file processed
by the -f
option. You cannot define your own substitution variables through the
SET statement in a JDBC or ODBC application.
Dynamic Partition Pruning
Dynamic partition pruning is a technique that prevents Impala from reading data files from partitions that are not included in the result set. This occurs when a query references a partition key column in the WHERE clause and the column values are unknown until the query runs. Impala evaluates the predicate and skips the I/O for unnecessary partitions. This technique is useful for join queries that involve large partitioned tables.
You can control the level of dynamic partition pruning through the
RUNTIME_FILTER_MODE
query option. By default, this option is enabled and set
at medium level: RUNTIME_FILTER_MODE=LOCAL)
. The maximum setting uses more
memory for queries than in previous releases. You can set
RUNTIME_FILTER_MODE=GLOBAL
to fully enable dynamic partition pruning. Before
you run a query, check the EXPLAIN output to verify that partition pruning is applied.
Nested Loop Join Queries
Nested loop joins make additional non-equijoin queries possible and optimize queries that retrieve values from complex type columns. Some join queries that previously needed equality comparisons can use operators.
Live Progress Reporting
set live_progress=true|false;
set live_summary=true|false;
Runtime Filtering
Runtime filtering is technique where Impala determines the filter conditions as a query runs and broadcasts the information to the Impala nodes reading a table. This technique is useful for optimizing queries against partitioned tables or to evaluate join conditions when only partial table data is needed. The technique eliminates the I/O required to read all of the partitioned data, as well as unnecessary network traffic.
Option | Description | Type | Default |
RUNTIME_FILTER_MODE | Adjusts the settings, turns the feature on and off, and controls how extensively the filters are transmitted between hosts. | numeric (0, 1, 2) or corresponding mnemonic strings (OFF, LOCAL, GLOBAL) | 1 (same as LOCAL) |
MAX_NUM_RUNTIME_FILTERS | Sets an upper limit on the number of runtime filters produced for each query. | integer | 10 |
RUNTIME_BLOOM_FILTER_SIZE | Size (in bytes) of Bloom filter data structure used by the runtime filtering feature. | integer |
1048576 (1 MB) Max: 16 MB |
RUNTIME_FILTER_WAIT_TIME_MS | Maximum filter wait time in milliseconds. By default, each scan node waits for up to 1 second (1000 milliseconds) for filters to arrive. If all filters have not arrived within the specified interval, the scan node proceeds, using whatever filters did arrive to help avoid reading unnecessary data. If a filter arrives after the scan node begins reading data, the scan node applies that filter to the data that is read after the filter arrives, but not to the data that was already read. | integer | 1000 milliseconds |
DISABLE_ROW_RUNTIME_FILTERING |
Reduces the scope of the runtime filtering feature. Queries still dynamically prune partitions, but do not apply the filtering logic to individual rows within partitions. Only applies to queries against Parquet tables. For other file formats, Impala only prunes at the level of partitions, not individual rows. |
Boolean; recognized values are 1 and 0, or true and false; any other value interpreted as false |
false
|
Data Types
Impala now supports new complex data types that can encode multiple named fields, positional items, or key-value pairs within a single column. You can combine these types to produce nested types with arbitrarily deep nesting. Currently, complex data types are only supported for the Parquet file format.
Data Type | Syntax |
STRUCT |
column_name STRUCT < name : type [COMMENT 'comment_string'], ... > type ::= primitive_type | complex_type |
ARRAY |
column_name ARRAY < type > type ::= primitive_type | complex_type |
MAP |
column_name MAP < primitive_type, type > type ::= primitive_type | complex_type
|
Operators
Operator | Description | Syntax |
ILIKE | Improvements enable the ILIKE operator to perform case-insensitive wildcard matches or regular expression matches, rather than explicitly converting column values with UPPER or LOWER. |
string_expression ILIKE wildcard_expression string_expression NOT ILIKE wildcard_expression |
IREGEXPR | Improvements enable the IREGEXPR operator to perform case-insensitive wildcard matches or regular expression matches, rather than explicitly converting column values with UPPER or LOWER. | string_expression IREGEXP regular_expression |
IS [NOT] DISTINCT FROM |
Compares values for a true or false result, even if one or both values are NULL. The IS NOT DISTINCT FROM operator, or its equivalent <=> notation, improves the efficiency of join queries that treat key values that are NULL in both tables as equal.
|
expression1 IS DISTINCT FROM expression2 expression1 IS NOT DISTINCT FROM expression2 expression1 <=> expression2
|
Statements
Statement | Description | Syntax |
TRUNCATE TABLE | Removes all data from a table without removing the table itself. The statement accepts the IF EXISTS clause, making TRUNCATE TABLE easier to use in setup or ETL scripts where the table might or might not exist. |
TRUNCATE TABLE [IF EXISTS] [db_name.]table_name
|
CREATE TABLE AS SELECT | The CREATE TABLE AS SELECT statement accepts a PARTITIONED BY clause. You can create a partitioned table and insert data into the table with a single statement. |
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] db_name.]table_name [PARTITIONED BY (col_name[, ...])] [COMMENT 'table_comment'] [WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)] [ [ROW FORMAT row_format] [STORED AS file_format] ] [LOCATION 'hdfs_path'] [TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)] [CACHED IN 'pool_name' [WITH REPLICATION = integer] | UNCACHED] AS select_statement |
SHOW DATABASES | The SHOW DATABASES statement returns two columns rather than one. The second column includes the associated comment string, if it exists, for each database. | SHOW DATABASES; |
DESCRIBE | The DESCRIBE statement displays metadata about a database. | DESCRIBE DATABASE db_name; |
DROP DATABASE | The DROP DATABASE statement works for a nonempty database. When you specify the optional CASCADE clause, any tables in the database are dropped before the database itself is removed. |
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT | CASCADE];
|
DROP TABLE/ALTER TABLE DROP PARTITION | The DROP TABLE, ALTER TABLE, and DROP PARTITION statements have an optional keyword, PURGE. PURGE causes Impala to immediately remove the relevant data files rather than sending them to the trashcan. This feature can help to avoid out-of-space errors on storage devices, and to avoid files being left behind in case of a problem with the trashcan. PURGE works when the trashcan is enabled. |
DROP TABLE [IF EXISTS] [db_name.]table_name [PURGE]
ALTER TABLE name { ADD [IF NOT EXISTS] | DROP [IF EXISTS] } PARTITION (partition_spec) [PURGE] |
Functions
- User-Defined Functions
- User-defined functions written in C++ persist when the catalog service is restarted. You no longer have to run the CREATE FUNCTION statements again after a restart. You must still reissue the CREATE FUNCTION statement for any Java-based user-defined functions. User-defined aggregate functions have more flexibility for intermediate data types.
- Analytic (Window) Functions
- The following table lists the new analytic functions:
Function Description Syntax PERCENT_RANK Calculates the rank, expressed as a percentage, of each row within a group of rows. If rank is the value for that same row from the RANK() function (from 1 to the total number of rows in the partition group), then the PERCENT_RANK() value is calculated as (rank - 1) / (rows_in_group - 1). If there is only a single item in the partition group, its PERCENT_RANK() value is 0. The ORDER BY clause is required. The PARTITION BY clause is optional. The window clause is not allowed. PERCENT_RANK (expr)
OVER ([partition_by_clause] order_by_clause)
NTILE Returns the "bucket number" associated with each row, between 1 and the value of an expression. For example, creating 100 buckets puts the lowest 1% of values in the first bucket, while creating 10 buckets puts the lowest 10% of values in the first bucket. Each partition can have a different number of buckets. The ORDER BY clause is required. The PARTITION BY clause is optional. The window clause is not allowed. NTILE (expr [, offset ...]
OVER ([partition_by_clause] order_by_clause)
CUME_DIST
Returns the cumulative distribution of a value. The value for each row in the result set is greater than 0 and less than or equal to 1.
The ORDER BY clause is required. The PARTITION BY clause is optional. The window clause is not allowed.
CUME_DIST (expr)
OVER ([partition_by_clause] order_by_clause)
- Math Functions
- The following table lists the new math functions:
Function Description Return Type cot(double a) Returns the cotangent of the argument. double factorial(integer_type a) Computes the factorial of an integer value and works with any integer type.
You can use either the factorial() function or the ! operator. The factorial of 0 is 1. The factorial() function returns 1 for any negative value. The maximum positive value for the input argument is 20; a value of 21 or greater overflows the range for a BIGINT and causes an error.
bigint radians(double a) Converts the argument value from degrees to radians. double - String Functions
- The following table lists the new string functions:
Function Description Return Type btrim(string a), btrim(string a, string chars_to_trim) Removes all instances of one or more characters from the start and end of a STRING value. By default, removes only spaces. If a non-NULL optional second argument is specified, the function removes all occurrences of characters in that second argument from the beginning and end of the string. string chr(int character_code) Returns a character specified by a decimal code point value. The interpretation and display of the resulting character depends on your system locale. Because consistent processing of Impala string values is only guaranteed for values within the ASCII range, only use this function for values corresponding to ASCII characters. In particular, parameter values greater than 255 return an empty string. string regexp_like(string source, string pattern[, string options])
Returns true or false to indicate whether the source string contains anywhere inside it the regular expression given by the pattern. The optional third argument consists of letter flags that change how the match is performed, such as i for case-insensitive matching. boolean split_part(string source, string delimiter, bigint n) Returns the nth field within a delimited string. The fields are numbered starting from 1. The delimiter can consist of multiple characters, not just a single character. All matching of the delimiter is done exactly, not using any regular expression patterns. string - Date/Time Functions
- The following table lists the new date and time functions:
Function Description Return Type int_months_between(timestamp newer, timestamp older) Returns the number of months between the date portions of two TIMESTAMP values, as an INT representing only the full months that passed. int months_between(timestamp newer, timestamp older) Returns the number of months between the date portions of two TIMESTAMP values. Can include a fractional part representing extra days in addition to the full months between the dates. The fractional component is computed by dividing the difference in days by 31 (regardless of the month). double timeofday() Returns a string representation of the current date and time, according to the time of the local system, including any time zone designation. string timestamp_cmp(timestamp t1, timestamp t2) Tests if one TIMESTAMP value is newer than, older than, or identical to another TIMESTAMP. int (either -1, 0, 1, or NULL) - Bit Manipulation Functions
- The following table lists the new bit manipulation functions:
Function Description Return Type bitand(integer_type a, same_type b) Returns an integer value representing the bits that are set to 1 in both of the arguments. If the arguments are of different sizes, the smaller is promoted to the type of the larger. Equivalent to the & binary operator. Same as the input value bitnot(integer_type a) Inverts all the bits of the input argument.
Equivalent to the ~ unary operator.
Same as the input value bitor(integer_type a, same_type b) Returns an integer value representing the bits that are set to 1 in either of the arguments. If the arguments are of different sizes, the smaller is promoted to the type of the larger. Equivalent to the | binary operator. Same as the input value bitxor(integer_type a, same_type b) Returns an integer value representing the bits that are set to 1 in one but not both of the arguments. If the arguments are of different sizes, the smaller is promoted to the type of the larger. Equivalent to the ^ binary operator. Same as the input value countset(integer_type a [, int zero_or_one]) By default, returns the number of 1 bits in the specified integer value. If the optional second argument is set to zero, it returns the number of 0 bits instead. Same as the input value getbit(integer_type a, int position) Returns a 0 or 1 representing the bit at a specified position. The positions are numbered right to left, starting at zero. The position argument cannot be negative. When you use a literal input value, it is treated as an 8-bit, 16-bit, and so on value, the smallest type that is appropriate. The type of the input value limits the range of the positions. Cast the input value to the appropriate type if you need to ensure it is treated as a 64-bit, 32-bit, and so on value. Same as the input value rotateleft(integer_type a, int positions) Rotates an integer value left by a specified number of bits. As the most significant bit is taken out of the original value, if it is a 1 bit, it is "rotated" back to the least significant bit. Therefore, the final value has the same number of 1 bits as the original value, just in different positions.
Specifying a second argument of zero leaves the original value unchanged. Rotating a -1 value by any number of positions still returns -1, because the original value has all 1 bits and all the 1 bits are preserved during rotation. Similarly, rotating a 0 value by any number of positions still returns 0. Rotating a value by the same number of bits as in the value returns the same value. Because this is a circular operation, the number of positions is not limited to the number of bits in the input value. For example, rotating an 8-bit value by 1, 9, 17, and so on positions returns an identical result in each case.
Same as the input value rotateright(integer_type a, int positions) Rotates an integer value right by a specified number of bits. As the least significant bit is taken out of the original value, if it is a 1 bit, it is "rotated" back to the most significant bit. Therefore, the final value has the same number of 1 bits as the original value, just in different positions. Specifying a second argument of zero leaves the original value unchanged. Rotating a -1 value by any number of positions still returns -1, because the original value has all 1 bits and all the 1 bits are preserved during rotation. Similarly, rotating a 0 value by any number of positions still returns 0. Rotating a value by the same number of bits as in the value returns the same value. Because this is a circular operation, the number of positions is not limited to the number of bits in the input value. For example, rotating an 8-bit value by 1, 9, 17, and so on positions returns an identical result in each case. Same as the input value setbit(integer_type a, int position [, int zero_or_one]) By default, changes a bit at a specified position to a 1, if it is not already. If the optional third argument is set to zero, the specified bit is set to 0 instead. If the bit at the specified position was already 1 (by default) or 0 (with a third argument of zero), the return value is the same as the first argument. The positions are numbered right to left, starting at zero. (Therefore, the return value could be different from the first argument even if the position argument is zero.) The position argument cannot be negative.
When you use a literal input value, it is treated as an 8-bit, 16-bit, and so on value, the smallest type that is appropriate. The type of the input value limits the range of the positions. Cast the input value to the appropriate type if you need to ensure it is treated as a 64-bit, 32-bit, and so on value.
Same as the input value shiftleft(integer_type a, int positions) Shifts an integer value left by a specified number of bits. As the most significant bit is taken out of the original value, it is discarded and the least significant bit becomes 0. The final value has either the same number of 1 bits as the original value, or fewer. Shifting an 8-bit value by 8 positions, a 16-bit value by 16 positions, and so on produces a result of zero. Specifying a second argument of zero leaves the original value unchanged. Shifting any value by 0 returns the original value. Shifting any value by 1 is the same as multiplying it by 2, as long as the value is small enough; larger values eventually become negative when shifted, as the sign bit is set. Starting with the value 1 and shifting it left by N positions gives the same result as 2 to the Nth power, or pow(2,N). Same as the input value shiftright(integer_type a, int positions) Shifts an integer value right by a specified number of bits. As the least significant bit is taken out of the original value, it is discarded and the most significant bit becomes 0. Therefore, the final value has either the same number of 1 bits as the original value, or fewer. Shifting an 8-bit value by 8 positions, a 16-bit value by 16 positions, and so on produces a result of zero. Specifying a second argument of zero leaves the original value unchanged. Shifting any value by 0 returns the original value. Shifting any positive value right by 1 is the same as dividing it by 2. Negative values become positive when shifted right.
Same as the input value - Miscellaneous Functions
- The following table lists a new miscellaneous function:
Function Description Syntax uuid() The uuid() function generates an alphanumeric value that you can use as a guaranteed unique identifier. The uniqueness applies across tables in cases where an ascending numeric sequence is not suitable. select uuid();
typeof(type value) A type conversion function that returns the name of the data type corresponding to an expression. For types with extra attributes, such as length for CHAR and VARCHAR, or precision and scale for DECIMAL, includes the full specification of the type. select typeof(type value);
Returns the type
For example, select typeof('xyz'); returns STRING
For example, select typeof(5.30001 / 2342.1); returns DECIMAL(13,11)