Optimizing HPE Ezmeral Data Fabric Database Lookups in Spark Jobs

The lookupFromMapRDB() API utilizes the primary and secondary indexes on a HPE Ezmeral Data Fabric Database table to optimize table lookups and outputs the results to an Apache Spark DataFrame.

IMPORTANT The lookupFromMapRDB() API functionality requires a patch. The patch works with EEP 6.2.0 (Core 6.1.0, Spark 2.4.0.0) and EEP 6.3.0 (Core 6.1.0, Spark 2.4.4.0). To install patches, see Applying a Patch

The loadfromMapRDB() API in MapR Database Connectors for Apache Spark is optimized to load massive amounts of data from HPE Ezmeral Data Fabric Database tables with high throughput. In cases where a Spark job needs to lookup a small number of documents based on the equality (or short range) condition on a primary or secondary key, the lookupFromMapRDB() API should be used.

Invoke the lookupFromMapRDB() API when the filter conditions in short range and equality queries reference primary and secondary keys. If the filter condition references any non-primary keys (fields other than the _id field), a secondary index must exist on the secondary keys. Indexes on the filtering keys is essential to achieving reasonable performance of lookup queries in HPE Ezmeral Data Fabric Database tables.

The lookupFromMapRDB() API uses the secondary keys in indexes to lookup values in the primary table. For example, if a query contains the filter conditions mydate = '2012-03-26' and myid = '120026015', a secondary index (of type composite) created on the mydate and myid fields must exist for the query to quickly output results.

Examples on the following tabs demonstrate how to invoke the lookupFromMapRDB() API to perform a lookup in a HPE Ezmeral Data Fabric Database table and output the results to an Apache Spark DataFrame:
import com.mapr.db.spark.sql._
import spark.implicits._
val df = spark.lookupFromMapRDB("/tbl")
df.filter("mydate" === "2012-03-26" && $"myid" === 120026015).show
SparkSession sparkSession = SparkSession.builder().getOrCreate();
MapRDBJavaSession mapRDBJavaSession = new MapRDBJavaSession(sparkSession);
Dataset<Row> df2 = mapRDBJavaSession.lookupFromMapRDB("/tbl");
df2.filter("mydate = '2012-03-26' and myid = '120026015'").show();
from pyspark.sql import SparkSession
df = spark.lookupFromMapRDB("/tbl")
df.filter("mydate = '2012-03-26' and myid = '120026015'").show()