Spark 2.0.1-1611 API Changes

This topic describes the public API changes that occurred between Apache Spark 1.6.1 and Spark 2.0.1.

Removed Methods

The following items have been removed from Apache Spark 2.0.1:

Bagel (the Spark implementation of Google Pregel)
Most of the deprecated methods from Spark 1.x, including:

Category	Subcategory	Instead of this removed API...	Use...
GraphX		mapReduceTriplets	aggregateMessages
		runSVDPlusPlus	run
		GraphKryoRegistrator
SQL	DataType	DataType.fromCaseClassString	DataType.fromJson
	DecimalType	DecimalType()	DecimalType(precision, scale) to provide precision explicitly
		DecimalType(Option[PrecisionInfo])	DecimalType(precision scale)
		PrecisionInfo	DecimalType(precision, scale)
		precisionInfo	precision and scale
		Unlimited	(No longer supported)
	Column	Column.in()	isin()
	DataFrame	toSchemaRDD	toDF
		createJDBCTable	write.jdbc()
		saveAsParquetFile	write.parquet()
		saveAsTable	write.saveAsTable()
		save	write.save()
		insertInto	write.mode(SaveMode.Append).saveAsTable()
	DataframeReader	DataFrameReader.load(path)	option("path", path).load()
	Functions	cumeDist	cume_dist
		denseRank	dense_rank
		percentRank	percent_rank
		rowNumber	row_number
		inputFileName	input_file_name
		isNaN	isnan
		sparkPartitionId	spark_partition_id
		callUDF	udf
Core	SparkContext	Constructors no longer take prefferedNodeLocationData param
		tachyonFolderName	externalBlockStoreFolderName
		initLocalProperties, clearFiles, clearJars	(No longer needed)
		runJob method no longer takes allowLocal param
		defaultMinSplits	defaultMinPartitions
		[Double, Int, Long, Float]AccumulatorParam	implicit objects from AccumulatorParam
		rddTo[Pair, Async, Sequence, Ordered]RDDFunctions	implicit functions from RDD
		[double, numeric]RDDToDoubleRDDFunctions	implicit functions from RDD
		intToIntWritable, longToLongWritable, floatToFloatWritable, doubleToDoubleWritable, boolToBoolWritable, bytesToBytesWritable, stringToText	implicit functions from WriteableFactory
		[int, long, double, float, boolean, bytes, string, writable]WritableConverter	implicit functions from WritableConverter
	TaskContext	runningLocally	isRunningLocally
		addOnCompleteCallback	addTaskCompletionListener
		attemptId	attemptNumber
	JavaRDDLike	splits	partitions
		toArray	collect
	JavaSparkContext	defaultMinSplits	defaultMinPartitions
		clearJars, clearFiles	(No longer needed)
	PairRDDFunctions	PairRDDFunctions.reduceByKeyToDriver	reduceByKeyLocally
	RDD	mapPartitionsWithContext	Taskcontext.get
		mapPartitionsWithSplit	mapPartitionsWithIndex
		mapWith	mapPartitionsWithIndex
		flatMapWith	mapPartitionsWithIndex and flatMap
		foreachWith	mapPartitionsWithIndex and foreach
		filterWith	mapPartitionsWithIndex and filter
		toArray	collect
	TaskInfo	TaskInfo.attempt	TaskInfo.attemptNumber
	Guava Optional	Guava Optional	org.apache.spark.api.java.Optional
	Vector	Vector, VectorSuite
Configuration options and params		--name
		--driver-memory	spark.driver.memory
		--driver-cores	spark.driver.cores
		--executor-memory	spark.executor.memory
		--executor-cores	spark.executor.cores
		--queue	spark.yarn.queue
		--files	spark.yarn.dist.files
		--archives	spark.yarn.dist.archives
		--addJars	spark.yarn.dist.jars
		--py-files	spark.submit.pyFiles

Note also the following deprecated configuration options and parameters:

Methods from Python DataFrame that returned RDD have been moved to dataframe.rdd. For example, df.map is now df.rdd.map.
Some streaming connectors (Twitter, Akka, MQTT, and ZeroMQ) have been removed.
org.apache.spark.shuffle.hash.HashShuffleManager no longer exists. SortShuffleManager is the default since Spark 1.2.
DataFrame is no longer a class. It is a subtype of DataSet.

Behavior Changes

Spark 2.0.1 implements the following behavior changes:

Spark 2.0.1 uses Scala 2.11 instead of 2.10.
Floating literals in SQL are now parsed as decimal type instead of double type.
The Kryo version is now 3.0.
Jersey version is now 2.
Java RDD flatMap and mapPartitions functions now require functions that return Java iterator instead of Iterable.
Java RDD countByKey and countApproxDistinctByKey now return Map[K => Long] instead of Map[K => Object].
When writing Parquet files, the summary files are no longer written (set parquet.enable.summary-metadata to true to re-enable).
Lots were changed in MLLib. Follow the Apache Spark Migration Guide.
Sparkcontext.emptyRDD now returns RDD instead of EmptyRDD.
Spark Standalone Master no longer serves the jobs history.
org.apache.spark.api.java.JavaPairRDD methods were changed:
- countByKey and countApproxDistinctByKey now return java.lang.Long instead of scala.Long.
- sampleByKey and sampleByKeyExact now return java.lang.Double instead of scala.Double.
The Old Application History format that created folders for each application has been removed.
org.apache.spark.Logging is now private. You can use slf4j directly instead.

Other Deprecated Items

Java 7 is now deprecated.
Python 2.6 is now deprecated.
TaskContext.isRunningLocally now is always false, as there is no more local execution of yarn-client and yarn-cluster as masters. Use --master yarn and --deploy-mode client/cluster.
Instead of HiveContext, use SparkSession.builder.enableHiveSupport.
Instead of SQLContext, use SparkSession.builder.
Some methods related to Accumulators, ShuffleWriteMetrics, SparklLoop, DataSet, and SQLContext are now deprecated. You will see warnings in your application logs if you use them.