Loading Data from MapR Database as an Apache Spark DataFrame
SparkSession
object:
def loadFromMapRDB[T](tableName: String,
schema: StructType): DataFrame
import com.mapr.db.spark.sql._
val df = sparkSession.loadFromMapRDB[T]("/tmp/user_profiles"): DataFrame
MapRDBJavaSession
object:
def loadFromMapRDB(tableName: String, schema: StructType, sampleSize: Double):DataFrame
import com.mapr.db.spark.sql.api.java.MapRDBJavaSession;
import org.apache.spark.sql.SparkSession;
MapRDBJavaSession maprSession = new MapRDBJavaSession(spark);
maprSession.loadFromMapRDB("/tmp/user_profiles");
SparkSession
object:
loadFromMapRDB(table_name, schema, sample_size)
from pyspark.sql import SparkSession
df = spark.loadFromMapRDB("/tmp/user_profiles")
tableName
parameter. Both DataFrames and MapR Database tables work with structured data. DataFrames need a fixed schema,
whereas MapR Database allows for a flexible schema. When loading data into a DataFrame, you can map
your data to a schema by specifying the schema parameter in the
loadFromMapRDB
call. You can also provide an application class as the type
[T]
parameter in the call. These two approaches are the preferred methods
for loading data into DataFrames.
For data exploration use cases, you might not know the schema of your MapR Database table. For those situations, the MapR Database OJAI connector for Apache Spark can infer the schema by sampling data from the table.
Whenever possible, the MapR Database OJAI Connector for Apache Spark pushes projections and filters for better performance. This allows MapR Database to project and filter data before returning it to your client application.
The following subtopics describe these techniques.