It’s been almost two months since we introduced Stratio Sparkta at Strata London 2015, showing a demo for real-time insights on twitter hashtags (slides available here).
During this time we added some new features to the real-time aggregation engine based on Spark Streaming, but we have been dedicated especially to the stabilization of the project and laying the groundwork for an upcoming web tool.
In particular, we have been working hard to improve the syntax of the aggregation policy, which has been completely revised. Since you don’t need to code anything in Spark Streaming when using Stratio Sparkta (cool, right?), the declarative definition of aggregation policies is quite important to us.
At the moment we have not invested much time in developing new inputs or outputs, but rather focused on improving what we already have. So, it is possible to receive events from Twitter, Flume, MQTT, Kafka and just about any other receiver of Spark Streaming and write aggregations to MongoDB, Apache Cassandra, Elasticsearch and HDFS + Parquet.
In summary, these are the most interesting improvements in this version:
- Added compatibility with Apache Spark 1.4.x. and released a specific distribution for Apache Spark 1.3.x
- Parquet and CSV outputs.
- Improved Twitter input. Possibility to filter tweets by hashtag.
- Fixed important bug in Elasticsearch output. Fixes are mapped according to their type.
- Policy refactor. Improved semantic of the JSON.
- Support for fragment composition in policies.
Here you can download the binary files:
- Stratio Sparkta 0.5.0 – Apache Spark 1.4.x [zip | tar.gz]
- Stratio Sparkta 0.5.0 – Apache Spark 1.3.x [zip | tar.gz]
And you can just type “vagrant init stratio/sparkta” if you want a full-featured sandbox to play a bit with sparkta (check out the guide).
Our next challenges are already underway:
- Benchmark: Official data throughput and ability to run the benchmark for yourself.
- Web policy management and creation wizard, with the ability to run and monitor the policy tool.
- Operational Intelligence: ability to set alarms based on the flow of events or aggregates (perhaps some of the benefits of Complex Event Processing)
- Query Component: Using Calcite and Spark to provide an API for querying the aggregated data.
Again, we want to know what you think or how you are using Stratio Sparkta. Our github repository is wide open because Stratio Sparkta is completely open source (Apache 2.0)