Profiling and segmentation: A graph database clustering solution


This post is about an exciting journey that starts with a problem and ends with a solution. One of the top banks in Europe came to us with a request: they needed a better profiling system.

We came up with a methodology for clustering nodes in a graph database according to concrete parameters.

We started by developing a Proof of Concept (POC) to test an approximation of the bank’s profiling data, using the following technologies:

  • Java / Scala, as Programming languages.
  • Apache Spark, to handle the given Data Sets.
  • Neo4j, as graph database.

The POC began as a 2-month long project in which we rapidly discovered a powerful solution to the bank’s needs.

We decided to use Neo4j, along with Cypher, Neo4j’s query language, because relationships are a core aspect of their data model. Their graph databases can manage highly connected data and complex queries. We were then able to make node clusters thanks to GraphX, an Apache Spark API for running graph and graph-parallel compute operations on data.

Monitoring the Spanish 2015 General Elections

We’re just a couple of days away from the Spanish general elections and Twitter is boiling up with campaign related messages. People want to have a say in what goes on in their country and they turn Twitter to express their opinions and feelings.

 

Social networks are starting to play a very important role in political events in Spain, that is why candidates from different parties are actively seeking to get the most profit from their presence in these type of platforms. They apply different strategies that allow them to connect with the people and, hopefully, gain their votes.

 

At Stratio we have been monitoring the campaign with our real-time data aggregation system, Stratio Sparkta, and with our visualization tool, Stratio Viewer. We use Apache Spark to to process the data and MongoDB to store it.

MongoDB – Spark Connector Whitepaper

We recently worked with MongoDB and their developer team for the analysis of their Hadoop based connector Vs our native connector solution. The paper highlights how Stratio’s connector for Apache Spark implements the PrunedFilteredScan API instead of the TableScan API which effectively allows you to avoid scanning the entire collection.

Our connector supports the Spark Catalyst optimizer for both rule-based and cost-based query optimization.

Stratio Sparkta 0.5.0 release

It’s been almost two months since we introduced Stratio Sparkta at Strata London 2015, showing a demo for real-time insights on twitter hashtags (slides available here).

During this time we added some new features to the real-time aggregation engine based on Spark Streaming, but we have been dedicated especially to the stabilization of the project and laying the groundwork for an upcoming web tool.

In particular, we have been working hard to improve the syntax of the aggregation policy, which has been completely revised. Since you don’t need to code anything in Spark Streaming when using Stratio Sparkta (cool, right?), the declarative definition of aggregation policies is quite important to us.

When Stratio Met Spark: A True Love Story

Certified distribution

Stratio is delighted to announce that it is officially a Certified Spark Distribution. The certification is very important for us because we deeply believe that the certification program provides many benefits to the Spark community: It facilitates collaboration and integration, offers broad evolution and support for the rich Spark ecosystem, simplifies adoption of critical security updates and allows development of applications valid for any certified distribution – a key ingredient for a successful ecosystem.

First Apache Spark meetup

Stratio kicked off its first Apache Spark meetups earlier this month. Our friends and members of other groups on meetup joined in when we announced the group in Madrid.
Only 2 weeks later we celebrated our first Spark meetup. The expectation was unusually high. We counted 101 members on the group, 50 RVSP’d and 8 members on the waiting list. This surprised us because Spark only recently became an Apache project and it is yet to be widely known in the mainstream enterprise in the US, never mind in Europe.