Ideas from Big Data Spain 2016

By Sondos Atwi @Sondos_4

On the 17th and 18th of November, I attended the Big Data Spain conference. It was my first time attending this type of events, and it was an excellent opportunity to meet experts in the fields and attend high-quality talks. So I decided to write this post to share a few of the presented slides and ideas.

Ps: Please excuse the quality of some slides/pictures, they were all taken by my phone camera 🙂

First, Congrats to Big Data Spain on being the second biggest Big Data conference in Europe, right after O’Reilly Strata. This year’s edition alsoBig Data Europe Conferences
had around 50% increase than last year’s!

 

Now let’s dig into the details…

The Developer’s Guide to Scala Implicit Values (Part I)

Implicit parameters and conversions are powerful tools in Scala increasingly used to develop concise, versatile tools such as DSLs, APIs, libraries…

When used correctly, they reduce the verbosity of Scala programs thus providing easy to read code. But their most important feature is that they offer a way to make your libraries functionality extendible without having to change their code nor needing to distribute it.

A great power comes with a great responsibility however. For new comers, as well as for relatively experienced Scala users, they can become a galaxy of confusions and pitfalls derived from the fact that the use of implicit values imply the compiler making decisions not obviously described in the code and following a set of rules with some unexpected results.

This post pretends to shed some light on the use of implicit values. Its content isn’t 100% original, it is just a tourist guide through this full of marvels, and sometimes dangerous, code jungle.
As most of those monstrous things that make us shiver, implicit values are mostly harmless once you get to know them.

Using Spark SQLContext, HiveContext & Spark Dataframes API with ElasticSearch, MongoDB & Cassandra

In this post we will show how to use the different SQL contexts for data query on Spark.
We will begin with Spark SQL and follow up with HiveContext. In addition to this, we will conduct queries on various NoSQL databases and analyze the advantages / disadvantages of using them, so without further ado, let’s get started!

First of all we need to create a context that will add Spark to the configuration options for connecting to Cassandra:

Spark SQLContext allows us to connect to different Data Sources to write or read data from them, but it has limitations, namely that when the program ends or the Spark shell is closed, all links to the datasoruces we have created are temporary and will not be available in the next session.

Huawei Appoints Stratio as Technology Partner

Proud to share the press release announcing Stratio as Huawei’s technological partner and looking forward to working together.

AMSTERDAM, Nov. 5, 2015 /PRNewswire/ — Huawei announced that Stratio has officially been certified as a Huawei Solution Partner (Technology) for Enterprise Data Centre Solutions.

Stratio, which pioneered the first Big Data platform using Apache Spark and integrating main NoSQL and SQL distributed databases, becomes Huawei’s first Big Data Technology Partner. The Stratio platform reduces complexity compared to other platforms, by giving customers control over all their Big Data software, and reduces Big Data time-to-value tenfold.

100 Stratians and counting

When we first started using Spark, we were twenty people. Twenty Stratians. We took a risk and adopted Spark very early on, but with a lot of teamwork and a lot of mistakes, we managed to create the first pure Spark platform.

We started getting more projects, and without realizing it 20 turned into 40, 40 into 60, and 60 into 100 Stratians. And we haven’t stopped growing ever since.

Top-k queries in Cassandra: An embedded mapreduce approach

Stratio has just added top-k queries support to its Lucene based implementation of the Cassandra’s secondary indexes. This implementation was originally designed to allow embedded full-text and multivariable search in Apache Cassandra. The previous release included an ad-hoc mechanism to perform distributed relevance queries based on the Lucene’s scoring algorithm. The current release generalizes this mechanism to allow several types of top-k queries.

Spark-MongoDB library

Once Data Sources API  has been released, we’ve wanted to take advantage of these new features and, for this reason, we have developed a Spark-MongoDB library. With this new connector we help the growing MongoDB community to simplify the interaction with this datasource via Spark.

This library provides the mechanism for accessing MongoDB collections in a structured way from SparkSQL, accesible from Python and Scala API’s. Since MongoDB is an open-source document database leader among NoSQL databases and is highly used in several projects [http://www.mongodb.com/leading-nosql-database] we find this connection with all the operations permitted by SparkSQL not only useful but necessary.

When Stratio Met Spark: A True Love Story

Certified distribution

Stratio is delighted to announce that it is officially a Certified Spark Distribution. The certification is very important for us because we deeply believe that the certification program provides many benefits to the Spark community: It facilitates collaboration and integration, offers broad evolution and support for the rich Spark ecosystem, simplifies adoption of critical security updates and allows development of applications valid for any certified distribution – a key ingredient for a successful ecosystem.