Using Serialization with the HPE Ezmeral Data Fabric Database OJAI Connector for Apache Spark

In the context of the HPE Ezmeral Data Fabric Database OJAI Connector for Apache Spark, serialization refers to the methods that read and write objects into bytes. This section describes how to configure your application to use a more efficient serializer.

The Apache Spark cluster framework requires serialization to exchange objects between driver and cluster executors. This type of serialization has nothing to do with the way HPE Ezmeral Data Fabric Database serializes the objects onto the disk.

Because classes used in Spark transformations or actions must be serializable, classes created for the HPE Ezmeral Data Fabric Database OJAI Connector for Apache Spark are serializable.

Spark uses Java serialization by default, but it can alternatively use Kyro Serialization. A new Kyro registrator is introduced so you can avoid using the default Java serialization. Kyro serialization provides better performance than Java serialization.

The following example shows how to set the new Kryo registrator in sparkconf:

new sparkconf()
   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
   .set("spark.kryo.registrator", "com.mapr.db.spark.OJAIKryoRegistrator")

A JSON document can use both complex and primitive value types. Java can serialize the primitive types, but for complex types (such as Map, Array, and Binary), you must use wrappers to achieve serialization. See Working with Complex JSON Document Types for details about these wrappers.

Time-related data types, such as ODate, OInterval, OTime, and OTimeStamp, use Java serialization by default. For efficiency, new serializers and comparators have been created for these data types.

Here are the new serializers and the type which each serializer applies:
Serializer Type
ODateSerializer ODate type
OTimeSerializer OTime
OTimeStampSerializer OTimeStamp
OIntervalSerializer - OInterval
DBBinaryValueSerializer ByteBuffer