Spark and Kerberos: a safe story

A follow-up to this post will be held at the Spark Summit East in Boston in February. Find out more.

***

Amongst all the Big Data technology madness, security seems to be an afterthought at best. When one talks about Big Data technologies and security, they are usually referring to the integration of these technologies with Kerberos. It’s true however that this trend seems to be changing for the better and we now have a few security options for these technologies, like TLS. Against this backdrop, we would like to take a look at the interaction between the most popular large-scale data processing technology, Apache Spark, and the most popular authentication framework, MIT’s Kerberos.

Creating a Recommender System (Part II)

After the resounding success of the first article on recommender systems, Alvaro Santos is back with some further insight into creating a recommender system.

 

Coming soon: A follow-up Meetup in Madrid to go even further into this exciting topic. Stay tuned!

***

In the previous article of this series, we explained what a recommender system is, describing its main parts and providing some basic algorithms which are frequently used in these systems. We also explained how to code some functions to read JSON files and to map the data in MongoDB and ElasticSearch using Spark SQL and Spark connectors.

This second part will cover:

  • Generating our Collaborative Filtering model.
  • Pre-calculating product / user recommendations.
  • Launching a small REST server to interact with the recommender.
  • Querying the data store to retrieve content-based recommendations.
  • Mixing the different types of recommendations to create a hybrid recommender.

Continuous Delivery in depth #2

The not so lean side

Remember issue #1 published in the summer? We are back with the next part in the series, wearing the hat of Pitfall Harry to look at  some of the issues we have come across and how these have impacted our day-to-day job. We also include some tips for overcoming them.

First things first: Jenkins’ pipelines are an awesome improvement over basic Jenkins funcionalities, allowing us to easily build complex continuous delivery flows, with extreme reusability and maintainability attributes.

Having said this, pipelines are code. And code is written by human beings. Human beings make mistakes. Such errors are reflected as software defects and execution failures.

This post will take a look at some of the defects, pitfalls and limitations of our (amazing) Jenkins’ pipelines, defining some possible workarounds.

Stratio Crossdata vs Presto

Introduction

Nowadays, there are a lot of Big Data query engines available. Some companies struggle to choose which one to use. Benchmarks exist, but results can be contradictory and thus difficult to trust.

One Big Data query engine that is frequently mentioned is Presto. We wanted to find out more about its potential and decided to compare it with Crossdata in a controlled environment, given that Crossdata is a data hub that extends the capabilities of Apache Spark. We detected that the most popular persisting layers in our projects are Apache Cassandra, MongoDB and HDFS+Parquet, but that MongoDB is not supported by Presto. The benchmark was therefore carried out with Apache Cassandra and HDFS+parquet only.

Crossdata provides additional features and optimizations to the SQLContext of Spark through the XDContext. It can be deployed as a library of Apache Spark or using a Client-Server architecture where the cluster of servers form a P2P structure.

The Developer’s Guide to Scala Implicit Values (Part II)

Imagine a rectangular grid of cells, in which each cell has a value – Either black (dead) or white (alive). And imagine that:

  1. Any live cell with two or three live neighbors survives for the next generation.
  2. Any cell with four or more neighbors dies from overpopulation.
  3. Any cell with one or no neighbors dies from isolation.
  4. Any dead cell with exactly three neighbors comes to life.

 

These are the four simple rules of Conway’s Game of Life . You could hardly imagine a simpler set of rules to code on your computer and you wouldn’t expect any interesting result at all, but…

Behold the wonders of its hidden might!

Continuous delivery in depth #1

Following on from a previous “Lunch & learn” about how Jenkins is being used for Stratio’s Continuous Delivery jobs (watch on Stratio’s youtube channel), it seemed logical to provide more information on our Jenkins pipeline plug-in usage.

In this first issue, we will  follow how pipelines are being used at Stratio Big Data to achieve full lifecycle traceability, from the development team to a final productive environment.

Some pitfalls were mentioned during the “Lunch & Learn” meeting and will be explained to help you fully comprehend the nature of the underlying bug and the solution achieved. This will follow in a second issue.

How to Aggregate Data in Real-Time with Stratio Sparta

When working with Big Data, it’s frequent to have the need to aggregate data in real-time, whether it comes from a specific service, such as social networks (Twitter, Facebook…) or even from more diverse sources, like a weather station. A good way to process these large amounts of information is with Spark Streaming, it provides us all the data in real time, but it has one problem: you have to program it yourself.

Variance in Scala (“Luke, he is your father too”)

When working with Big Data, sometimes it’s useful to remember that powerful products wouldn’t work properly without the tools that build them. It’s possible to start programming in Scala with a few case classes and a bunch of for-comprehensions, but those are only little scratches in a huge ice surface like Scala is. It may not be enough to make your code clean and comprehensible.  I’ve been developing with this programming language for almost 4 years, and every day I discover a new feature that surprises me. That acknowledgement, in the end, is the main reason to keep digging deeper into Scala.

Stratio’s Lucene-based index for Cassandra is now a plugin

Thanks to the changes proposed at CASSANDRA-8717CASSANDRA-7575 and CASSANDRA-6480, Stratio is glad to present its Lucene-based implementation of Cassandra secondary indexes as a plugin that can be attached to the Apache distribution. Before the above changes, Lucene index was distributed inside a fork of Apache Cassandra, with all the difficulties it implied, i.e. maintaining a fork. As of now, the fork is discontinued and new users should use the recently created plugin, which maintains all the features of Stratio Cassandra.

 

Stratio’s Lucene index extends Cassandra’s functionality to provide near real-time distributed search engine capabilities such as with ElasticSearch or Solr, including full text search capabilities, free multivariable search,

MongoDB – Spark Connector Whitepaper

We recently worked with MongoDB and their developer team for the analysis of their Hadoop based connector Vs our native connector solution. The paper highlights how Stratio’s connector for Apache Spark implements the PrunedFilteredScan API instead of the TableScan API which effectively allows you to avoid scanning the entire collection.

Our connector supports the Spark Catalyst optimizer for both rule-based and cost-based query optimization.