Continuous Delivery in depth #3

This will be the last installment in the “Continuous Delivery in depth” series. After the good and the bad, here comes the ugly. Ugly because of the amount of changes required: a pull request with 308 commits was merged adding 2932 lines, whilst removing a whooping 10112. This represented about a 75% loc reduction, obviously improving the maintainability.

To go further on the topic, on 27 April 2017 you have the opportunity to join the first JAM in Madrid: confirm your assistance!

Data Lake: A more Technical
Point of View


Companies have come to realize of late that the real value of their business is data. There has been a rush to create huge Data Lakes to store the enormous amounts of data available inside each company. The concept of a Data Lake is that of a low cost, but highly scalable infrastructure in which all types of data can be stored.

This sounds good, but creating a Data Lake is not easy and a good design is a must.

A whooping 1K releases
using Jenkins!


We don’t usually like to boast but on this one we can’t hold back. As of 17 February 2017, a huge (but just symbolic) milestone was reached: more than 1000 automated releases performed by our Jenkins installation, from each continuous delivery pipeline.

Freeze!! Continuous Delivery not working!


This is the first part of a story, a story about how important it is to have a reliable release and deployment process.

Anyone working in our sector has had to deal with deployments. I’d like to launch this series with a bunch of interesting questions that will let you know whether you should change your deployment process.

Profiling and segmentation: A graph database clustering solution


This post is about an exciting journey that starts with a problem and ends with a solution. One of the top banks in Europe came to us with a request: they needed a better profiling system.

We came up with a methodology for clustering nodes in a graph database according to concrete parameters.

We started by developing a Proof of Concept (POC) to test an approximation of the bank’s profiling data, using the following technologies:

  • Java / Scala, as Programming languages.
  • Apache Spark, to handle the given Data Sets.
  • Neo4j, as graph database.

The POC began as a 2-month long project in which we rapidly discovered a powerful solution to the bank’s needs.

We decided to use Neo4j, along with Cypher, Neo4j’s query language, because relationships are a core aspect of their data model. Their graph databases can manage highly connected data and complex queries. We were then able to make node clusters thanks to GraphX, an Apache Spark API for running graph and graph-parallel compute operations on data.

Employee turnover: the good, the bad, and the ugly


It is a common truism in Human Resources that labor turnover is generally bad news for a given company and that management must take precautionary measures to reduce it, or at least to keep it under control.

When there is market demand for the services performed by an employee that leaves a company, the latter finds itself with unforeseen expenses to qualify, source, hire, and “onboard” a suitable replacement.

Aside from being out-of-pocket, which is directly reflected in the balance sheet, the wound is usually deeper than that. There are hidden costs, linked to the loss of business in the area that the former employee was presumably contributing to, during both the period in which the position is open and the time it takes to accommodate the new employee and retrain him or her to reach the peak level of productivity of their predecessor. These hidden costs may also be traced back to the loss of knowledge that may leave a gap and affect other employees in the department and hinder cooperation and so forth.  These indirect costs are hard to properly quantify, but typically dwarf the direct accounting costs that are immediately felt in the cash flow statements. Estimates of the impact taken from literature range from 25% to 200% of the departing employee’s annual compensation, with the figure probably being industry or even business dependent.

Spark and Kerberos: a safe story

A follow-up to this post will be held at the Spark Summit East in Boston in February. Find out more.

***

Amongst all the Big Data technology madness, security seems to be an afterthought at best. When one talks about Big Data technologies and security, they are usually referring to the integration of these technologies with Kerberos. It’s true however that this trend seems to be changing for the better and we now have a few security options for these technologies, like TLS. Against this backdrop, we would like to take a look at the interaction between the most popular large-scale data processing technology, Apache Spark, and the most popular authentication framework, MIT’s Kerberos.

Creating a Recommender System (Part II)

After the resounding success of the first article on recommender systems, Alvaro Santos is back with some further insight into creating a recommender system.

 

Coming soon: A follow-up Meetup in Madrid to go even further into this exciting topic. Stay tuned!

***

In the previous article of this series, we explained what a recommender system is, describing its main parts and providing some basic algorithms which are frequently used in these systems. We also explained how to code some functions to read JSON files and to map the data in MongoDB and ElasticSearch using Spark SQL and Spark connectors.

This second part will cover:

  • Generating our Collaborative Filtering model.
  • Pre-calculating product / user recommendations.
  • Launching a small REST server to interact with the recommender.
  • Querying the data store to retrieve content-based recommendations.
  • Mixing the different types of recommendations to create a hybrid recommender.

Ideas from Big Data Spain 2016

By Sondos Atwi @Sondos_4

On the 17th and 18th of November, I attended the Big Data Spain conference. It was my first time attending this type of events, and it was an excellent opportunity to meet experts in the fields and attend high-quality talks. So I decided to write this post to share a few of the presented slides and ideas.

Ps: Please excuse the quality of some slides/pictures, they were all taken by my phone camera 🙂

First, Congrats to Big Data Spain on being the second biggest Big Data conference in Europe, right after O’Reilly Strata. This year’s edition alsoBig Data Europe Conferences
had around 50% increase than last year’s!

 

Now let’s dig into the details…

Continuous Delivery in depth #2

The not so lean side

Remember issue #1 published in the summer? We are back with the next part in the series, wearing the hat of Pitfall Harry to look at  some of the issues we have come across and how these have impacted our day-to-day job. We also include some tips for overcoming them.

First things first: Jenkins’ pipelines are an awesome improvement over basic Jenkins funcionalities, allowing us to easily build complex continuous delivery flows, with extreme reusability and maintainability attributes.

Having said this, pipelines are code. And code is written by human beings. Human beings make mistakes. Such errors are reflected as software defects and execution failures.

This post will take a look at some of the defects, pitfalls and limitations of our (amazing) Jenkins’ pipelines, defining some possible workarounds.