MongoDB – Spark Connector Whitepaper

We recently worked with MongoDB and their developer team for the analysis of their Hadoop based connector Vs our native connector solution. The paper highlights how Stratio’s connector for Apache Spark implements the PrunedFilteredScan API instead of the TableScan API which effectively allows you to avoid scanning the entire collection.

Our connector supports the Spark Catalyst optimizer for both rule-based and cost-based query optimization. To operate against multi-structured data, the connector infers the schema by sampling documents from the MongoDB collection. This process is controlled by the samplingRatio parameter. If the schema is known, the developer can provide it to the connector, avoiding the need for any inference. Once data is stored in MongoDB, Stratio provides an ODBC/JDBC connector for integrating results with any BI tool.

The connector can be downloaded from the community Spark Packages repository.

Installation is simple – the connector can be included in a Spark application with a single command.

One of the main advantages of implementing the Dataframe API from Spark is that you can integrate different data sources, i.e you could make a join between a MongoDB table and an ElasticSearch collection.

Many thanks to Mat Keep and Sam Weaver from MongoDB, and our team of devs for making the analysis.

Download the whitepaper here.

Leave a comment

Please be polite. We appreciate that. Your email address will not be published and required fields are marked