Data Lake: A more Technical
Point of View


Companies have come to realize of late that the real value of their business is data. There has been a rush to create huge Data Lakes to store the enormous amounts of data available inside each company. The concept of a Data Lake is that of a low cost, but highly scalable infrastructure in which all types of data can be stored.

This sounds good, but creating a Data Lake is not easy and a good design is a must.

A whooping 1K releases
using Jenkins!


We don’t usually like to boast but on this one we can’t hold back. As of 17 February 2017, a huge (but just symbolic) milestone was reached: more than 1000 automated releases performed by our Jenkins installation, from each continuous delivery pipeline.

Profiling and segmentation: A graph database clustering solution


This post is about an exciting journey that starts with a problem and ends with a solution. One of the top banks in Europe came to us with a request: they needed a better profiling system.

We came up with a methodology for clustering nodes in a graph database according to concrete parameters.

We started by developing a Proof of Concept (POC) to test an approximation of the bank’s profiling data, using the following technologies:

  • Java / Scala, as Programming languages.
  • Apache Spark, to handle the given Data Sets.
  • Neo4j, as graph database.

The POC began as a 2-month long project in which we rapidly discovered a powerful solution to the bank’s needs.

We decided to use Neo4j, along with Cypher, Neo4j’s query language, because relationships are a core aspect of their data model. Their graph databases can manage highly connected data and complex queries. We were then able to make node clusters thanks to GraphX, an Apache Spark API for running graph and graph-parallel compute operations on data.

Ideas from Big Data Spain 2016

By Sondos Atwi @Sondos_4

On the 17th and 18th of November, I attended the Big Data Spain conference. It was my first time attending this type of events, and it was an excellent opportunity to meet experts in the fields and attend high-quality talks. So I decided to write this post to share a few of the presented slides and ideas.

Ps: Please excuse the quality of some slides/pictures, they were all taken by my phone camera 🙂

First, Congrats to Big Data Spain on being the second biggest Big Data conference in Europe, right after O’Reilly Strata. This year’s edition alsoBig Data Europe Conferences
had around 50% increase than last year’s!

 

Now let’s dig into the details…

How to Aggregate Data in Real-Time with Stratio Sparta

When working with Big Data, it’s frequent to have the need to aggregate data in real-time, whether it comes from a specific service, such as social networks (Twitter, Facebook…) or even from more diverse sources, like a weather station. A good way to process these large amounts of information is with Spark Streaming, it provides us all the data in real time, but it has one problem: you have to program it yourself.

Monitoring the Spanish 2015 General Elections

We’re just a couple of days away from the Spanish general elections and Twitter is boiling up with campaign related messages. People want to have a say in what goes on in their country and they turn Twitter to express their opinions and feelings.

 

Social networks are starting to play a very important role in political events in Spain, that is why candidates from different parties are actively seeking to get the most profit from their presence in these type of platforms. They apply different strategies that allow them to connect with the people and, hopefully, gain their votes.

 

At Stratio we have been monitoring the campaign with our real-time data aggregation system, Stratio Sparkta, and with our visualization tool, Stratio Viewer. We use Apache Spark to to process the data and MongoDB to store it.

Huawei Appoints Stratio as Technology Partner

Proud to share the press release announcing Stratio as Huawei’s technological partner and looking forward to working together.

AMSTERDAM, Nov. 5, 2015 /PRNewswire/ — Huawei announced that Stratio has officially been certified as a Huawei Solution Partner (Technology) for Enterprise Data Centre Solutions.

Stratio, which pioneered the first Big Data platform using Apache Spark and integrating main NoSQL and SQL distributed databases, becomes Huawei’s first Big Data Technology Partner. The Stratio platform reduces complexity compared to other platforms, by giving customers control over all their Big Data software, and reduces Big Data time-to-value tenfold.

MongoDB – Spark Connector Whitepaper

We recently worked with MongoDB and their developer team for the analysis of their Hadoop based connector Vs our native connector solution. The paper highlights how Stratio’s connector for Apache Spark implements the PrunedFilteredScan API instead of the TableScan API which effectively allows you to avoid scanning the entire collection.

Our connector supports the Spark Catalyst optimizer for both rule-based and cost-based query optimization.

100 Stratians and counting

When we first started using Spark, we were twenty people. Twenty Stratians. We took a risk and adopted Spark very early on, but with a lot of teamwork and a lot of mistakes, we managed to create the first pure Spark platform.

We started getting more projects, and without realizing it 20 turned into 40, 40 into 60, and 60 into 100 Stratians. And we haven’t stopped growing ever since.

Supporting service-based multi realm authentication and authorization

Security is often a forgotten concern in Big Data environments. However, as these technologies are being embraced by companies with sensitive data (think, for example, about banks or insurance companies), security is a growing requirement. In Stratio, we are aware of our clients’ needs, so we are studying the development of an integrated security solution for our platform.