Spark 2.0.1-1611 API Changes

This topic describes the public API changes that occurred between Apache Spark 1.6.1 and Spark 2.0.1.

Removed Methods

The following items have been removed from Apache Spark 2.0.1:
  • Bagel (the Spark implementation of Google Pregel)
  • Most of the deprecated methods from Spark 1.x, including:
Category Subcategory Instead of this removed API... Use...
GraphX mapReduceTriplets aggregateMessages
runSVDPlusPlus run
GraphKryoRegistrator
SQL DataType DataType.fromCaseClassString DataType.fromJson
DecimalType DecimalType() DecimalType(precision, scale) to provide precision explicitly
DecimalType(Option[PrecisionInfo]) DecimalType(precision scale)
PrecisionInfo DecimalType(precision, scale)
precisionInfo precision and scale
Unlimited (No longer supported)
Column Column.in() isin()
DataFrame toSchemaRDD toDF
createJDBCTable write.jdbc()
saveAsParquetFile write.parquet()
saveAsTable write.saveAsTable()
save write.save()
insertInto write.mode(SaveMode.Append).saveAsTable()
DataframeReader DataFrameReader.load(path) option("path", path).load()
Functions cumeDist cume_dist
denseRank dense_rank
percentRank percent_rank
rowNumber row_number
inputFileName input_file_name
isNaN isnan
sparkPartitionId spark_partition_id
callUDF udf
Core SparkContext Constructors no longer take prefferedNodeLocationData param
tachyonFolderName externalBlockStoreFolderName
initLocalProperties, clearFiles, clearJars (No longer needed)
runJob method no longer takes allowLocal param
defaultMinSplits defaultMinPartitions
[Double, Int, Long, Float]AccumulatorParam implicit objects from AccumulatorParam
rddTo[Pair, Async, Sequence, Ordered]RDDFunctions implicit functions from RDD
[double, numeric]RDDToDoubleRDDFunctions implicit functions from RDD
intToIntWritable, longToLongWritable, floatToFloatWritable, doubleToDoubleWritable, boolToBoolWritable, bytesToBytesWritable, stringToText implicit functions from WriteableFactory
[int, long, double, float, boolean, bytes, string, writable]WritableConverter implicit functions from WritableConverter
TaskContext runningLocally isRunningLocally
addOnCompleteCallback addTaskCompletionListener
attemptId attemptNumber
JavaRDDLike splits partitions
toArray collect
JavaSparkContext defaultMinSplits defaultMinPartitions
clearJars, clearFiles (No longer needed)
PairRDDFunctions PairRDDFunctions.reduceByKeyToDriver reduceByKeyLocally
RDD mapPartitionsWithContext Taskcontext.get
mapPartitionsWithSplit mapPartitionsWithIndex
mapWith mapPartitionsWithIndex
flatMapWith mapPartitionsWithIndex and flatMap
foreachWith mapPartitionsWithIndex and foreach
filterWith mapPartitionsWithIndex and filter
toArray collect
TaskInfo TaskInfo.attempt TaskInfo.attemptNumber
Guava Optional Guava Optional org.apache.spark.api.java.Optional
Vector Vector, VectorSuite
Configuration options and params --name
--driver-memory spark.driver.memory
--driver-cores spark.driver.cores
--executor-memory spark.executor.memory
--executor-cores spark.executor.cores
--queue spark.yarn.queue
--files spark.yarn.dist.files
--archives spark.yarn.dist.archives
--addJars spark.yarn.dist.jars
--py-files spark.submit.pyFiles
Note also the following deprecated configuration options and parameters:
  • Methods from Python DataFrame that returned RDD have been moved to dataframe.rdd. For example, df.map is now df.rdd.map.
  • Some streaming connectors (Twitter, Akka, MQTT, and ZeroMQ) have been removed.
  • org.apache.spark.shuffle.hash.HashShuffleManager no longer exists. SortShuffleManager is the default since Spark 1.2.
  • DataFrame is no longer a class. It is a subtype of DataSet.

Behavior Changes

Spark 2.0.1 implements the following behavior changes:

  • Spark 2.0.1 uses Scala 2.11 instead of 2.10.
  • Floating literals in SQL are now parsed as decimal type instead of double type.
  • The Kryo version is now 3.0.
  • Jersey version is now 2.
  • Java RDD flatMap and mapPartitions functions now require functions that return Java iterator instead of Iterable.
  • Java RDD countByKey and countApproxDistinctByKey now return Map[K => Long] instead of Map[K => Object].
  • When writing Parquet files, the summary files are no longer written (set parquet.enable.summary-metadata to true to re-enable).
  • Lots were changed in MLLib. Follow the Apache Spark Migration Guide.
  • Sparkcontext.emptyRDD now returns RDD instead of EmptyRDD.
  • Spark Standalone Master no longer serves the jobs history.
  • org.apache.spark.api.java.JavaPairRDD methods were changed:
    • countByKey and countApproxDistinctByKey now return java.lang.Long instead of scala.Long.
    • sampleByKey and sampleByKeyExact now return java.lang.Double instead of scala.Double.
  • The Old Application History format that created folders for each application has been removed.
  • org.apache.spark.Logging is now private. You can use slf4j directly instead.

Other Deprecated Items

  • Java 7 is now deprecated.
  • Python 2.6 is now deprecated.
  • TaskContext.isRunningLocally now is always false, as there is no more local execution of yarn-client and yarn-cluster as masters. Use --master yarn and --deploy-mode client/cluster.
  • Instead of HiveContext, use SparkSession.builder.enableHiveSupport.
  • Instead of SQLContext, use SparkSession.builder.
  • Some methods related to Accumulators, ShuffleWriteMetrics, SparklLoop, DataSet, and SQLContext are now deprecated. You will see warnings in your application logs if you use them.