Swarm Intelligence Metaheuristics, part 1: Ant Colony Optimization

In a previous post, we reviewed the taxonomy of metaheuristic algorithms for optimization within the context of feature selection in machine learning problems. We explained how feature selection can be tackled as a combinatorial optimization problem in a huge search space, and how heuristic algorithms (or simply metaheuristics) are able to find good solutions -although not necessarily optimal- in a reasonable amount of time by exploring such space in an intelligent manner. Recall that metaheuristics are especially well fitted when the function being optimized is non-differentiable or does not have an analytical expression at all (for instance, the magnitude being optimized is the result of a randomized complex simulation under a parameter set that constitutes a candidate solution). Note that maths cannot help us in such cases and metaheuristics can be the only way to go.

Read More

Correlation does not imply… sluggishness

When the Father of Statistics Ronald Fisher started to witness mounting evidence in favor of the association between smoking and lung cancer, he was quick to fall back into the maxim he helped coin “Correlation does not imply causation”, discounting the evidence as spurious and continuing his habit as a heavy smoker of cigarettes. While his command of mathematics was well above his contemporaries at the time, this speaks volumes for the fact that everybody is prone to bias, and that we should attempt not to fall in love with our hypotheses too early in our decision making process. While Fisher may have had a point, the fact that he was a smoker himself certainly clouded his thinking and led him to not consider fairly all possible explanations for the available evidence.

Read More

Cooking ML Models

Have you ever watched the cooking teaching shows? You have probably noticed that chefs have usually already all the ingredients separated and chopped. A chef probably will be more useful and creative cooking rather than spending time peeling and chopping potatoes, even though it is still important in the recipe. Likewise, a data scientist will be more useful and creative building models rather than spending time with data preprocessing. In this way, where a chef would prepare exquisite delicacies a data scientist prepares succulent models.

Read More

Statistical Comparison of Machine Learning Algorithms (Part 1)

This is the first of a two-part series dealing with the application of statistical tests for the formal comparison of several Machine Learning (ML) algorithms in order to determine whether one generally outperforms the rest or not. In this first chapter, we explain the fundamentals of statistical tests, while in the second part, we examine how they are applied to ML algorithm performance data with the aim of comparing them from a statistical point of view.

Read More

The definitive visual build tool for Apache Spark: Sparta 2.0

Stratio has created a new user interface that allows you to work without writing a single line of code, which means that no programming skills are needed nor to use advanced technologies such as Spark, Scala, HDFS, or Elasticsearch. Developers, Architects, BI engineers, data scientists, business users and IT administrators can create data analytics applications in minutes with a powerful Spark Visual Editor. Welcome to Sparta 2.0, a brand-new version of Sparta born with the forthcoming release of the Stratio Data Centric Platform.

Read More

Inside the black box: explaining the unexplainable

Data analysts are often confronted with a seemingly difficult decision, to choose between a simple model, easy to understand, but lacking in predictive power, or a more complex one, that can attain higher accuracy, but simultaneously leaving a funny feeling in the user, who is left wondering about how it works and perhaps more importantly, how it arrived at its result. As with most things in life, everything comes with a cost. And balancing competing goals requires dealing with trade-offs. This is not an easy choice, as if the analyst cannot explain the reasoning behind the model, neither can she explain it to demanding business users. And therein lies the dilemma. Those experienced in the field of analytics have probably faced this dichotomy once or twice throughout their careers.

Read More