One File System to ingest them all (and in Kafka bind them)

Sounds epic, doesn’t it? Actually, it’s not that epic!

It could be interesting (or very geeky) to talk about how to ingest data in Middle-earth (and what for). However, I guess it would be out of the scope of this blog, so I’m afraid this post has nothing to do with that. This post is about how to ingest data from different kinds of file systems by means of Kafka-Connect using a connector I’ve forged recently.

SparkArt! Part One: Exploring the depths of Mandelbrot Set with Spark

On March the 26th 2012, James Cameron and his submarine craft, Deepsea Challenger, explored the depths of the ocean down to 11km under sea level at 11.329903°N 142.199305°E, an infinitesimal point on the surface of the Earth’s vast Oceans. Could you imagine how incredible it would be to have thousands of “Deep Challengers” reaching the bottom of our planet in parallel? What a map we would get!

3 guards, 2 doors…
Solution to the riddle!

Our participation at Spark Summit 2017 in San Francisco was once again a great experience. Not only because of the quality of the speakers, but the special atmosphere built up by Spark lovers. In order to enliven the breaks between sessions talks, we brought a riddle, inviting all attendees to solve it.

Evolutionary feature subset selection in big datasets (Part I)

When we want to fit a Machine Learning (ML) model to a big dataset, it is often recommended to carefully pre-process the input data in order to obtain better results. Although it is widely accepted that more data lead to better results, this is not necessarily true when referred to the number of variables of our data. Some variables may be noisy, redundant and not useful. 

UXSpain 2017: understanding user needs from different perspectives

On 12-13 May 2017,  the 6th edition of UX Spain took place. If there was a scale enabling us to measure when an event has reached its maturity, UX Spain would be, after this edition, at the top. Once again, an impeccable organization, outstanding speakers with juicy content and two consecutive days as a trending topic on Twitter have turned UXSpain into a reference in the Design and UX sector. Well done to the organizers!  

Continuous Delivery in depth #3

This will be the last installment in the “Continuous Delivery in depth” series. After the good and the bad, here comes the ugly. Ugly because of the amount of changes required: a pull request with 308 commits was merged adding 2932 lines, whilst removing a whooping 10112. This represented about a 75% loc reduction, obviously improving the maintainability.

To go further on the topic, on 27 April 2017 you have the opportunity to join the first JAM in Madrid: confirm your assistance!

Data Lake: A more Technical
Point of View


Companies have come to realize of late that the real value of their business is data. There has been a rush to create huge Data Lakes to store the enormous amounts of data available inside each company. The concept of a Data Lake is that of a low cost, but highly scalable infrastructure in which all types of data can be stored.

This sounds good, but creating a Data Lake is not easy and a good design is a must.

A whooping 1K releases
using Jenkins!


We don’t usually like to boast but on this one we can’t hold back. As of 17 February 2017, a huge (but just symbolic) milestone was reached: more than 1000 automated releases performed by our Jenkins installation, from each continuous delivery pipeline.

Freeze!! Continuous Delivery not working!


This is the first part of a story, a story about how important it is to have a reliable release and deployment process.

Anyone working in our sector has had to deal with deployments. I’d like to launch this series with a bunch of interesting questions that will let you know whether you should change your deployment process.

Profiling and segmentation: A graph database clustering solution


This post is about an exciting journey that starts with a problem and ends with a solution. One of the top banks in Europe came to us with a request: they needed a better profiling system.

We came up with a methodology for clustering nodes in a graph database according to concrete parameters.

We started by developing a Proof of Concept (POC) to test an approximation of the bank’s profiling data, using the following technologies:

  • Java / Scala, as Programming languages.
  • Apache Spark, to handle the given Data Sets.
  • Neo4j, as graph database.

The POC began as a 2-month long project in which we rapidly discovered a powerful solution to the bank’s needs.

We decided to use Neo4j, along with Cypher, Neo4j’s query language, because relationships are a core aspect of their data model. Their graph databases can manage highly connected data and complex queries. We were then able to make node clusters thanks to GraphX, an Apache Spark API for running graph and graph-parallel compute operations on data.