Stratio are the proud Platinum sponsors of the 3rd edition of Big Data Spain, one of the largest conferences of Big Data in Europe.…
We are happy to announce that our talks “META: An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities” and ” Apache Cassandra at Telefonica CBS” have been selected for the Cassandra Summit 2014.…
Data matures like wine, applications like fishJames Governor
Stratio 0.9.1 release
We are proud to announce that Stratio 0.9.1 release is currently available. On May 2014 Stratio achieved the major milestone of releasing the first pure Spark and enterprise ready big data platform. But this was just the begining, the first step in a long journey. Just a couple of months later a new release is ready including a large amount of new features and improvements. Here are some highlights:
Stratio kicked off its first Apache Spark meetups earlier this month. Our friends and members of other groups on meetup joined in when we announced the group in Madrid.
Only 2 weeks later we celebrated our first Spark meetup. The expectation was unusually high. We counted 101 members on the group, 50 RVSP’d and 8 members on the waiting list. This surprised us because Spark only recently became an Apache project and it is yet to be widely known in the mainstream enterprise in the US, never mind in Europe.
The world is one big data problemAndrew McAfee, Center for Digital Business at the MIT Sloan School of Management
Spark Infographic: Advantages, activity, evolution of Spark adoption and main headlines.
This paper has been presented at the Eurosys 2013 conference and is avaiblable for download at the conference website. The paper presents BlinkDB that, despite its name, is not a database but a query engine on top of Hive and Shark, and it is used for running interactive SQL queries on large volumes of data using data samples. BlinkDB is built using two key ideas: an adaptive optimization framework to build and maintain stratified samples, and a dynamic sample selection strategy to select appropiately sized sample based on a query’s accuracy or response time requirements.
This paper offers an interesting introduction on how to apply statistical inference technics on Big Data and makes clear that there is always a trade-off between accuracy and performance. In that regard, BlinkDB offers information about query accuracy so the user can make decisions. Although it is not clear what the cost of maintaining stratified samples is, the paper provides a good seed for future works in the area.
 Agarwal, Sameer, et al. “BlinkDB: queries with bounded errors and bounded response times on very large data.” Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 2013.
El paper de la semana es “Building LinkedIn’s real-time activity data pipeline”, en cual se encuentra disponible para su descarga a través de sites.computer.org
Recomendamos este paper por los siguientes motivos:
- Cuenta la experiencia, los problemas y mejoras de un caso de uso real con Apache Kafka.
- Muestra en detalle cifras reales de dimensionamiento, valores de rendimiento obtenidos y demás métricas interesantes de un sistema basado en Kafka.
- Aporta ideas de mejora, como el confiar en la cache del sistema operativo.
Aclaramos que este paper no explica el funcionamiento de Apache Kafka, sino que muestra en detalle experiencias y mejoras en un caso real.
Puedes descargarte el pdf original pulsando aquí.