Spark 2.4.0.0-1904 (EEP 6.2.0) Release Notes

This section provides reference information, including new features, patches, and known issues for Spark 2.4.0.

The notes below relate specifically to the MapR Distribution for Apache Hadoop. This release of Spark has backward-compatibility changes, see the open-source Spark 2.4.0.0 Release Notes for more information.

These release notes contain only MapR-specific information and are not necessarily cumulative in nature. For information about how to use the release notes, see Ecosystem Component Release Notes.

Spark Version	2.4.0.0
Release Date	April 2019
MapR Version Interoperability	See EEP Components and OS Support.
Source on GitHub	https://github.com/mapr/spark
GitHub Release Tag	2.4.0.0-mapr-1904
Maven Artifacts	https://repository.mapr.com/maven/
Package Names	Navigate to https://package.ezmeral.hpe.com/releases/MEP/ and select your EEP and OS to view the list of package names.

IMPORTANT

Starting with EEP 6.0.0, keyStore and trustStore passwords can be removed from the spark-defaults.conf file and can be set in the /opt/mapr/conf/ssl-client.xml file.
Starting with EEP 6.0.0, after an upgrade, configuration files of previous versions are saved in the /opt/mapr/spark directory.
The MapR 6.1 and EEP 6.0.0 release introduces "Simplified Security". If you are using these versions and enable security on your MapR cluster, MapR scripts automatically configure Spark security features.

Hive Support

This version of Spark supports integration with Hive. However, note the following exceptions:

Hive-on-Spark is not supported.
Spark-SQL is supported, but it is not fully compatible with Hive. For details, see the Apache Spark documentation and the MapR Spark documentation.

New in This Release

For a complete list of all new features, refer to the open source documentation.

Fixes

This MapR release includes the following new fixes since the latest MapR Spark 2.3.1 release. For details, refer to the commit log for this project in GitHub.

GitHub Commit	Date (YYYY-MM-DD)	Comment
4bdca6c	2019-02-25	MapR [SPARK-427] Update kafka in Spark-2.4.0 to the 1.1.1-mapr
0ccea10	2019-02-25	MapR [SPARK-434] Move absent commits from 2.3.2 branch
0af5795	2019-02-25	MapR [SPARK-442] Spark build fails beacuse of the wrong tests in spark-streaming-kafka-10 module
d40b974	2019-02-25	MapR [SPARK-446] Spark configure.sh doesn't start/stop Spark services
d42400a	2019-02-25	MapR [SPARK-430] PID files should be under /opt/mapr/pid
b7eec10	2019-02-25	MapR [SPARK-221] Investigate possibility to move creating of the spark-env.sh from private-pkg to configure.sh
399d5b8	2019-02-25	MapR [SPARK-287] Move logic of creating /apps/spark folder from installer's scripts to the configure.sh
0170f29	2019-02-25	[SPARK-449] Kafka offset commit issue fixed
2497c80	2019-02-25	MapR [SPARK-417] impersonation fixes for spark executor. Impersonation is moved from HadoopRDD.compute() method to org.apache.spark.executor.Executor.run() method
e1d14ed	2019-02-25	MapR [SPARK-456] Spark shell can't be started
1cc194b	2019-02-26	[SPARK-466] SparkR errors fixed
9c4cf43	2019-02-26	[SPARK-379] Fix Spark version for Avro and Kubernetes integration tests
4436a8a	2019-02-26	MapR [SPARK-464] Can't submit spark 2.4 jobs from mapr-client
b14e1a6	2019-02-27	MapR [SPARK-465] Error messages after update of spark 2.4
c9fa510	2019-02-28	MapR [K8S-637][K8S] Add configure.sh configuration in spark-defaults.conf for job runtime
11e3daf	2019-02-28	MapR [SPARK-481] Cannot run spark configure.sh on Client node
4a740fb	2019-03-01	MapR [SPARK-486][K8S] Fix sasl encryption error on Kubernetes
30f88de	2019-03-07	MapR [SPARK-416] CVE-2018-1320 vulnerability in Apache Thrift
a3f0109	2019-03-08	MapR [SPARK-496] Spark HS UI doesn't work
f60e8a4	2019-03-08	MapR [SPARK-482] Spark streaming app fails to start by UnknownTopicOrPartitionException with checkpoint
71f5db9	2019-03-15	MapR [SPARK-514] Recovery from checkpoint is broken
ba9e107	2019-03-18	MapR [SPARK-515] Move configuring spark-env.sh back to the private-pkg
9fbdc61	2019-03-19	MapR [SPARK-515][K8S] Remove configure.sh call for k8s
cbbd78f	2019/03/19	MapR [SPARK-492] Spark 2.4.0.0 configure.sh has error messages
fce6079	2019/03/19	SPARK-463] MAPR_MAVEN_REPO variable for specifying mapR repository
100aff7	2019/03/22	MapR [SPARK-494] Spark - Distribute Notice.txt across components starting with MEP 6.2
a4e4259	2019/03/25	MapR [SPARK-460] Spark Metrics for CollectD Configuration for collecting Spark metrics
7615273	2019/03/26	MapR [SPARK-510] nonmapr "admin" users not able to view other user logs in SHS
80edc50	2019/03/26	[SPARK-508] MapR-DB OJAI Connector for Spark isNull condition returns incorrect result
dfc0022	2019/03/28	MapR [SPARK-462] Spark and SparkHistoryServer allow week ciphers, which can allow man in the middle attack
baf607e	2019/03/28	MapR [SPARK-461] Stop graph after jobs completion to prevent 'java.lang.IllegalStateException: No active subscriptions'
d48945f	2019/04/04	MapR [SPARK-516] Spark jobs failure using yarn mode on kerberos fixed
1c793f8	2019/04/11	MapR [SPARK-531] Remove duplicating entries from classpath in ClasspathFilter
6a39ff6	2019/04/11	SPARK-444] Fix of hive version for spark dev branches
c5aeb67	2019/04/15	Spark 2.4.0 backport 2.4.1
2ae047f	2019/04/19	SPARK-539 Workaround for absent MapRDBJsonSplit class
94eb0f1	2019/04/20	K8S-853: Enable spark metrics for external tenant
c7abaf8	2019/04/22	MapR [SPARK-536] PySpark streaming package for kafka-0-10 added
ef70d34	2019/04/22	MapR [SPARK-540] Include 'avro' artifacts
f08108e	2019/04/23	MapR [K8S-893] Hide plain text password from logs
28ddfe9e	2019/05/17	MapR [SPARK-541] Avoid duplication of the first unexpired record

The following tickets are back-ported from Spark 2.4.1:

SPARK-26709 - OptimizeMetadataOnlyQuery does not correctly handle the files with zero record
SPARK-26080 - Unable to run worker.py on Windows
SPARK-26873 - FileFormatWriter creates inconsistent MR job IDs
SPARK-26745 - Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines
SPARK-26677 - Incorrect results of not(eqNullSafe) when data read from Parquet file
SPARK-26708 - Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan
SPARK-26267 - Kafka source may reprocess data
SPARK-26706 - Fix Cast$mayTruncate for bytes
SPARK-26078 - WHERE .. IN fails to filter rows when used in combination with UNION
SPARK-26233 - Incorrect decimal value with java beans and first/last/max... functions
SPARK-27097 - Avoid embedding platform-dependent offsets literally in whole-stage generated code
SPARK-26188 - Spark 2.4.0 Partitioning behavior breaks backwards compatibility
SPARK-25921 - Python worker reuse causes Barrier tasks to run without BarrierTaskContext

Known Issues

pyspark.sql.utils.AnalysisException - Python OJAI connector failure caused by incorrect resolution of python user-defined function calls by Spark SQL parser.
The same SQL expressions from SELECT clause and GROUP BY clause resolves to different expression IDs.
Sample SQL query that leads to pyspark.sql.utils.AnalysisException, the stringtodate1(yelping_since) expression is used in SELECT and GROUP BY, stringtodate1 is python user-defined function:
```
SELECT business_id, stringtodate1(yelping_since) AS startyear, avg(stars) AS avgstars FROM temp_table_name GROUP BY business_id, stringtodate1(yelping_since)
```
Workaround: stringtodate1(yelping_since) expression in GROUP BY is replaced with alias startyear.
```
SELECT business_id, stringtodate1(yelping_since) AS startyear, avg(stars) AS avgstars FROM temp_table_name GROUP BY business_id, startyear
```

Resolved Issues

None.