Driving Digital Transformation through Big Data

A Stratio Success Story

Stratio DataCentric came into existence because of a technological gap that exists in the world today” Nacho Navarro, Stratio

What is Stratio? This is a question that we can really only answer now, three years after our foundation by a team of seasoned engineers in 2013. Why has it taken us so long? Because we have been busy pulling together the most transformational and disruptive tool ever to exist in the short history of Big Data. We started with a vision and have made it a reality.

Javier Cortejoso, Gaspar Muñoz and Nacho Navarro reminiscence the journey towards the creation of Stratio’s powerful, state-of-the-art tool: Stratio DataCentric.

Stratio Crossdata vs Presto

Introduction

Nowadays, there are a lot of Big Data query engines available. Some companies struggle to choose which one to use. Benchmarks exist, but results can be contradictory and thus difficult to trust.

One Big Data query engine that is frequently mentioned is Presto. We wanted to find out more about its potential and decided to compare it with Crossdata in a controlled environment, given that Crossdata is a data hub that extends the capabilities of Apache Spark. We detected that the most popular persisting layers in our projects are Apache Cassandra, MongoDB and HDFS+Parquet, but that MongoDB is not supported by Presto. The benchmark was therefore carried out with Apache Cassandra and HDFS+parquet only.

Crossdata provides additional features and optimizations to the SQLContext of Spark through the XDContext. It can be deployed as a library of Apache Spark or using a Client-Server architecture where the cluster of servers form a P2P structure.

Creating a Recommender System (Part I)


This two-article series explains how to design and implement a hybrid recommender system that works just like the ones used by Amazon or Ebay.

Introduction

Let’s start with a short definition from Wikipedia:

Recommender systems or recommendation systems (sometimes replacing “system” with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that a user would give to an item.

The following diagram is a basic illustration:

Recommender System diagram
Recommender System diagram

A recommender system analyses input data which contains information on different products and their user ratings. After reading and processing the data, the system  creates a model that can be used to predict ratings for a particular product or user.

Approaches

In the recommender system world, there are three types of approaches to filter products:

The Developer’s Guide to Scala Implicit Values (Part II)

Imagine a rectangular grid of cells, in which each cell has a value – Either black (dead) or white (alive). And imagine that:

  1. Any live cell with two or three live neighbors survives for the next generation.
  2. Any cell with four or more neighbors dies from overpopulation.
  3. Any cell with one or no neighbors dies from isolation.
  4. Any dead cell with exactly three neighbors comes to life.

 

These are the four simple rules of Conway’s Game of Life . You could hardly imagine a simpler set of rules to code on your computer and you wouldn’t expect any interesting result at all, but…

Behold the wonders of its hidden might!

Stratio @ #MesosCon Europe

MesosCon Europe was held in Amsterdam from August 31 to September 2 and a small representation of Stratio’s crew was over there.

Benjamin Hindman’s opening keynote

Mesosphere’s Co-Founder & Chief Architect Benjamin Hindman broke the ice with the first keynote.

Stratio team at MesosCon Europe
Alberto Rodriguez and Andrés Macarrilla at MesosCon Europe

After talking about the mesos ecosystem’s growth within the last months, he explained the nested containerization model and the improvements in the Mesos resource allocation. He then introduced an African nonprofit organization called praekelt.org dedicated to using mobile technology to improve the lives of people living in poverty. A representative from the NGO explained how mesos and DC/OS are a perfect fit for its cluster provisioning. The NGO has to run quite a few clusters where 80% of the setup is identical for each cluster and the remaining 20% is different. They therefore get the most out of mesos and DC/OS by deploying these distinctive features separately. The representative was asked what the Mesos’ biggest deficiency is nowadays. He replied that they were struggling to find a persistence layer that fits their current needs (he pointed out that they are currently using GlusterFS as their persistence backend).

Continuous delivery in depth #1

Following on from a previous “Lunch & learn” about how Jenkins is being used for Stratio’s Continuous Delivery jobs (watch on Stratio’s youtube channel), it seemed logical to provide more information on our Jenkins pipeline plug-in usage.

In this first issue, we will  follow how pipelines are being used at Stratio Big Data to achieve full lifecycle traceability, from the development team to a final productive environment.

Some pitfalls were mentioned during the “Lunch & Learn” meeting and will be explained to help you fully comprehend the nature of the underlying bug and the solution achieved. This will follow in a second issue.

The Developer’s Guide to Scala Implicit Values (Part I)

Implicit parameters and conversions are powerful tools in Scala increasingly used to develop concise, versatile tools such as DSLs, APIs, libraries…

When used correctly, they reduce the verbosity of Scala programs thus providing easy to read code. But their most important feature is that they offer a way to make your libraries functionality extendible without having to change their code nor needing to distribute it.

A great power comes with a great responsibility however. For new comers, as well as for relatively experienced Scala users, they can become a galaxy of confusions and pitfalls derived from the fact that the use of implicit values imply the compiler making decisions not obviously described in the code and following a set of rules with some unexpected results.

This post pretends to shed some light on the use of implicit values. Its content isn’t 100% original, it is just a tourist guide through this full of marvels, and sometimes dangerous, code jungle.
As most of those monstrous things that make us shiver, implicit values are mostly harmless once you get to know them.

Benchmarking Machine learning prediction models

When surfing the internet, it is quite easy to find sites comparing the most popular Machine learning toolkits (datascience.stackexchange.com, oreilly.com or udacity.com ). These sites give you a lot of information about the strengths and weaknesses of the libraries, how they work and some examples to compare how easy it is to use these types of tools. Therefore, if you are new to the business, they are very helpful for finding the right library to begin to study your data. Actually, they are written by Data Scientists for Data Scientists.

However, as a Software Engineer you would rather know if these tools are going to work well or just crash your servers. Based on this premise, the main objective of this article is to explore some Machine learning libraries and see how they work in a real time semi-production scenario.

Using Spark SQLContext, HiveContext & Spark Dataframes API with ElasticSearch, MongoDB & Cassandra

In this post we will show how to use the different SQL contexts for data query on Spark.
We will begin with Spark SQL and follow up with HiveContext. In addition to this, we will conduct queries on various NoSQL databases and analyze the advantages / disadvantages of using them, so without further ado, let’s get started!

First of all we need to create a context that will add Spark to the configuration options for connecting to Cassandra:

Spark SQLContext allows us to connect to different Data Sources to write or read data from them, but it has limitations, namely that when the program ends or the Spark shell is closed, all links to the datasoruces we have created are temporary and will not be available in the next session.

How to Aggregate Data in Real-Time with Stratio Sparta

When working with Big Data, it’s frequent to have the need to aggregate data in real-time, whether it comes from a specific service, such as social networks (Twitter, Facebook…) or even from more diverse sources, like a weather station. A good way to process these large amounts of information is with Spark Streaming, it provides us all the data in real time, but it has one problem: you have to program it yourself.