Apache Spark Feature Support

HPE Ezmeral Data Fabric supports most Apache Spark features. However, there are some exceptions.

GPU Aware Scheduling Support on Spark

HPE Ezmeral Data Fabric does not support GPU aware scheduling feature on Spark 3.1.2 and Spark 3.2.0. You must have Hadoop 3.x to support the GPU aware scheduling feature on Spark 3. However, EEP 8.x.x supports Hadoop 2.7.6 and thus YARN will not be able to allocate the GPU resources for the Spark Applications.

Delta Lake Support on Spark

Starting from EEP 8.x.x, Apache Spark 3 provides Delta Lake support on HPE Ezmeral Data Fabric.

Delta Lake is an open-source storage layer that supports ACID (Atomicity, Consistency, Isolation, and Durability) transactions to provide reliability, consistency, and scalability to Apache Spark applications. Delta Lake runs on the top of the existing storage and is compatible with Apache Spark APIs. For more details, see Delta Lake documentation.

You can use any Apache Spark APIs to read and write data with Delta Lake. Delta Lake stores the data in Parquet format as versioned Parquet files. Delta Lake has a well-defined open protocol called Delta Transaction Protocol that provides ACID transactions to Apache Spark applications.

To enable the Delta Lake:

Download the Delta Lake library from Maven repository.

Add the Delta Lake library and set the following configuration options. For example:

/opt/mapr/spark/spark-3.1.2/bin/spark-shell --jars ~/delta-core_2.1.2-1.0.0.jar
--conf "spark.sql.extensions"= io.delta.sql.DeltaSparkSessionExtension
--conf "spark.sql.catalog.spark_catalog"= org.apache.spark.sql.delta.catalog.DeltaCatalog

Delta Lake stores the commits of every successful transaction (Spark job) as a DeltaLog or a Delta Lake transaction log.

For example: You can view these commit logs on MinIO Browser by navigating to /<table_name>/_delta_Log/.

Commits in the transaction log:

/<table_name>/_delta_log/00000000000000000000.json
/<table_name>/_delta_log/00000000000000000001.json
/<table_name>/_delta_log/00000000000000000003.json

Delta lake uses optimistic concurrency control to provide ACID transactions between writes operation. See Concurrency Control.

See Setup Apache Spark with Delta Lake and Advanced Dependency Management to start using Delta Lake.

Spark SQL and Apache Derby Support on Spark

If you are using Spark SQL with Derby database without Hive or Hive Metastore installation, you will see the following exception:

java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Add the hive-service-2.3.*.jar and log4j2 jars to /opt/mapr/spark/spark-3.x.x/jars location to use Spark SQL with Derby Database without Hive or Hive Metastore installation.

The log4j2 jars are located at /opt/mapr/lib/log4j2/log4j-*.jar location.

Spark 3.1.2 and Spark 3.2.0 does not support log4j1.2 logging on HPE Ezmeral Data Fabric.

Spark Thrift JDBC/ODBC Server Support

Running the Spark Thrift JDBC/ODBC Server on a secure cluster is supported only on Spark 2.1.0 or later.

You can run the Spark Thrift JDBC/ODBC Server to enable connections to Hive 1.2.1 using Beeline; however, you can connect only to Hive versions supported by your Spark version.

Spark SQL and Hive Support for Spark 2.1.0

Spark 2.1.0 is able to connect to Hive 2.1 Metastore; however, only features of Hive 1.2 are supported.

Spark SQL and Hive Support for Spark 2.0.1

Spark SQL is supported, but it is not fully compatible with Hive. For details, see the Apache Spark documentation.

The following Hive functions are not supported in Spark SQL:

Tables with buckets
UNION type
Unique join
Column statistics collecting
Output formats: File format (for CLI), Hadoop Archive
Block-level bitmap indexes and virtual columns
Automatic determination of the number of reducers for JOIN and GROUP BY
Metadata-only query
Skew data flag
STREAMTABLE hint in JOIN
Merging of multiple small files for query results

Spark SQL and Hive Support for Spark 1.6.1

Spark SQL is supported, but it is not fully compatible with Hive. For details, see the Apache Spark documentation. The following Spark SQL operations support the following Hive table formats:

	Hive 1.2 Table Format
Spark SQL Operations	AVRO	ORC	Parquet	RC	default
create	Yes	Yes	Yes	Yes	Yes
drop	Yes	Yes	Yes	Yes	Yes
insert into	Yes	Yes	Yes	Yes	Yes
insert overwrite	Yes	Yes	Yes	Yes	Yes
select	Yes	Yes	Yes	Yes	Yes
load data	Yes	Yes	Yes	Yes	Yes