Spark 2.4.0.0-1904 (EEP 6.2.0) Release Notes

This section provides reference information, including new features, patches, and known issues for Spark 2.4.0.

The notes below relate specifically to the MapR Distribution for Apache Hadoop. This release of Spark has backward-compatibility changes, see the open-source Spark 2.4.0.0 Release Notes for more information.

These release notes contain only MapR-specific information and are not necessarily cumulative in nature. For information about how to use the release notes, see Ecosystem Component Release Notes.

Spark Version 2.4.0.0
Release Date April 2019
MapR Version Interoperability See EEP Components and OS Support.
Source on GitHub https://github.com/mapr/spark
GitHub Release Tag 2.4.0.0-mapr-1904
Maven Artifacts https://repository.mapr.com/maven/
Package Names Navigate to https://package.ezmeral.hpe.com/releases/MEP/ and select your EEP and OS to view the list of package names.
IMPORTANT
  • Starting with EEP 6.0.0, keyStore and trustStore passwords can be removed from the spark-defaults.conf file and can be set in the /opt/mapr/conf/ssl-client.xml file.
  • Starting with EEP 6.0.0, after an upgrade, configuration files of previous versions are saved in the /opt/mapr/spark directory.
  • The MapR 6.1 and EEP 6.0.0 release introduces "Simplified Security". If you are using these versions and enable security on your MapR cluster, MapR scripts automatically configure Spark security features.

Hive Support

This version of Spark supports integration with Hive. However, note the following exceptions:

New in This Release

Fixes

This MapR release includes the following new fixes since the latest MapR Spark 2.3.1 release. For details, refer to the commit log for this project in GitHub.

GitHub Commit Date (YYYY-MM-DD) Comment
4bdca6c 2019-02-25 MapR [SPARK-427] Update kafka in Spark-2.4.0 to the 1.1.1-mapr
0ccea10 2019-02-25 MapR [SPARK-434] Move absent commits from 2.3.2 branch
0af5795 2019-02-25 MapR [SPARK-442] Spark build fails beacuse of the wrong tests in spark-streaming-kafka-10 module
d40b974 2019-02-25 MapR [SPARK-446] Spark configure.sh doesn't start/stop Spark services
d42400a 2019-02-25 MapR [SPARK-430] PID files should be under /opt/mapr/pid
b7eec10 2019-02-25 MapR [SPARK-221] Investigate possibility to move creating of the spark-env.sh from private-pkg to configure.sh
399d5b8 2019-02-25 MapR [SPARK-287] Move logic of creating /apps/spark folder from installer's scripts to the configure.sh
0170f29 2019-02-25 [SPARK-449] Kafka offset commit issue fixed
2497c80 2019-02-25 MapR [SPARK-417] impersonation fixes for spark executor. Impersonation is moved from HadoopRDD.compute() method to org.apache.spark.executor.Executor.run() method
e1d14ed 2019-02-25 MapR [SPARK-456] Spark shell can't be started
1cc194b 2019-02-26 [SPARK-466] SparkR errors fixed
9c4cf43 2019-02-26 [SPARK-379] Fix Spark version for Avro and Kubernetes integration tests
4436a8a 2019-02-26 MapR [SPARK-464] Can't submit spark 2.4 jobs from mapr-client
b14e1a6 2019-02-27 MapR [SPARK-465] Error messages after update of spark 2.4
c9fa510 2019-02-28 MapR [K8S-637][K8S] Add configure.sh configuration in spark-defaults.conf for job runtime
11e3daf 2019-02-28 MapR [SPARK-481] Cannot run spark configure.sh on Client node
4a740fb 2019-03-01 MapR [SPARK-486][K8S] Fix sasl encryption error on Kubernetes
30f88de 2019-03-07 MapR [SPARK-416] CVE-2018-1320 vulnerability in Apache Thrift
a3f0109 2019-03-08 MapR [SPARK-496] Spark HS UI doesn't work
f60e8a4 2019-03-08 MapR [SPARK-482] Spark streaming app fails to start by UnknownTopicOrPartitionException with checkpoint
71f5db9 2019-03-15 MapR [SPARK-514] Recovery from checkpoint is broken
ba9e107 2019-03-18 MapR [SPARK-515] Move configuring spark-env.sh back to the private-pkg
9fbdc61 2019-03-19 MapR [SPARK-515][K8S] Remove configure.sh call for k8s
cbbd78f 2019/03/19 MapR [SPARK-492] Spark 2.4.0.0 configure.sh has error messages
fce6079 2019/03/19 SPARK-463] MAPR_MAVEN_REPO variable for specifying mapR repository
100aff7 2019/03/22 MapR [SPARK-494] Spark - Distribute Notice.txt across components starting with MEP 6.2
a4e4259 2019/03/25 MapR [SPARK-460] Spark Metrics for CollectD Configuration for collecting Spark metrics
7615273 2019/03/26 MapR [SPARK-510] nonmapr "admin" users not able to view other user logs in SHS
80edc50 2019/03/26 [SPARK-508] MapR-DB OJAI Connector for Spark isNull condition returns incorrect result
dfc0022 2019/03/28 MapR [SPARK-462] Spark and SparkHistoryServer allow week ciphers, which can allow man in the middle attack
baf607e 2019/03/28 MapR [SPARK-461] Stop graph after jobs completion to prevent 'java.lang.IllegalStateException: No active subscriptions'
d48945f 2019/04/04 MapR [SPARK-516] Spark jobs failure using yarn mode on kerberos fixed
1c793f8 2019/04/11 MapR [SPARK-531] Remove duplicating entries from classpath in ClasspathFilter
6a39ff6 2019/04/11 SPARK-444] Fix of hive version for spark dev branches
c5aeb67 2019/04/15 Spark 2.4.0 backport 2.4.1
2ae047f 2019/04/19 SPARK-539 Workaround for absent MapRDBJsonSplit class
94eb0f1 2019/04/20 K8S-853: Enable spark metrics for external tenant
c7abaf8 2019/04/22 MapR [SPARK-536] PySpark streaming package for kafka-0-10 added
ef70d34 2019/04/22 MapR [SPARK-540] Include 'avro' artifacts
f08108e 2019/04/23 MapR [K8S-893] Hide plain text password from logs
28ddfe9e 2019/05/17 MapR [SPARK-541] Avoid duplication of the first unexpired record
The following tickets are back-ported from Spark 2.4.1:
  • SPARK-26709 - OptimizeMetadataOnlyQuery does not correctly handle the files with zero record
  • SPARK-26080 - Unable to run worker.py on Windows
  • SPARK-26873 - FileFormatWriter creates inconsistent MR job IDs
  • SPARK-26745 - Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines
  • SPARK-26677 - Incorrect results of not(eqNullSafe) when data read from Parquet file
  • SPARK-26708 - Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan
  • SPARK-26267 - Kafka source may reprocess data
  • SPARK-26706 - Fix Cast$mayTruncate for bytes
  • SPARK-26078 - WHERE .. IN fails to filter rows when used in combination with UNION
  • SPARK-26233 - Incorrect decimal value with java beans and first/last/max... functions
  • SPARK-27097 - Avoid embedding platform-dependent offsets literally in whole-stage generated code
  • SPARK-26188 - Spark 2.4.0 Partitioning behavior breaks backwards compatibility
  • SPARK-25921 - Python worker reuse causes Barrier tasks to run without BarrierTaskContext

Known Issues

  • pyspark.sql.utils.AnalysisException - Python OJAI connector failure caused by incorrect resolution of python user-defined function calls by Spark SQL parser.

    The same SQL expressions from SELECT clause and GROUP BY clause resolves to different expression IDs.

    Sample SQL query that leads to pyspark.sql.utils.AnalysisException, the stringtodate1(yelping_since) expression is used in SELECT and GROUP BY, stringtodate1 is python user-defined function:

    SELECT business_id, stringtodate1(yelping_since) AS startyear, avg(stars) AS avgstars FROM temp_table_name GROUP BY business_id, stringtodate1(yelping_since)
    Workaround: stringtodate1(yelping_since) expression in GROUP BY is replaced with alias startyear.
    SELECT business_id, stringtodate1(yelping_since) AS startyear, avg(stars) AS avgstars FROM temp_table_name GROUP BY business_id, startyear

Resolved Issues

  • None.