Benchmarking Machine learning prediction models

When surfing the internet, it is quite easy to find sites comparing the most popular Machine learning toolkits (, or ). These sites give you a lot of information about the strengths and weaknesses of the libraries, how they work and some examples to compare how easy it is to use these types of tools. Therefore, if you are new to the business, they are very helpful for finding the right library to begin to study your data. Actually, they are written by Data Scientists for Data Scientists.

However, as a Software Engineer you would rather know if these tools are going to work well or just crash your servers. Based on this premise, the main objective of this article is to explore some Machine learning libraries and see how they work in a real time semi-production scenario.

A bit of background

Firstly, a little background of how Machine Learning projects work. These projects are mainly divided in two phases:

  • Modeling: this first one is focused on data and the objective is to find the better model that fits best and can explain the input.
    • Reading: obtaining data from all the sources available such as DataBases, ETLs or Logs.
    • Cleaning: although there is plenty of buck of information, they are just raw data and it would be hard to exploit it as they are, so several techniques have to be applied in order to convert them into valuable and readable information.
    • Modeling: the model or the vector of equations are obtained applying different algorithms. Later this model can be used to predict or classify input data.
    • Validation: after the model is created, it must be validated using several techniques like Cross-Validation in order to know if its behaviour is expected. If not, the model must be calculated again.
    • Evaluation: a metric to evaluate the precision of the model is defined from the subset of the data available. Furthermore, this metric will be used as a criteria for evaluating the model in the update phase.
  • Model execution: in this phase the model is calculated before it can be used in order to predict or classify the input data.
    • Loading the model: just reading the model from a DB or disk.
    • Reading the input data: the input data for the model must be read and pre-processed in order to be used by our model.
    • Execution: apply the model to our input data. However, depending on the nature of the model, it can be a prediction or a classification.

ML Toolkits

Although there are plenty of Machine learning toolkits, this article is only focused on the three most important from my point of view: R, Scikit and Spark MLLib.

Comparison Chart

In order to have a global view of the three ML libraries, there is a table below comparing them:

R Scikit-learn MLLib
*Programming Language * R Python Scala/Java/Python
Range of algorithms Extended Good Limited to distributed only
Speed (small-medium data size) Med-High Med Med
Scalability for Big Data Very limited only scale vertically Very limited only scale vertically Excellent
Data source Integration Good Very Good Very Good
Visualization tools Very good Very good Limited and depends on other partners
Learning curve High Small Average
IDEs Rstudio / Jupyter Eclipse / Jidea / Jupyter Eclipse / Jidea / Jupyter

Model Execution Options

The main object of this article is to benchmark the execution of a ML model in a semi real-time environment. Therefore, a simple scenario has been selected where customers will send requests with some arguments of their browsers to a REST service and it will respond with a prediction of their Browser using a trained ML model.

R scenario

Although R language has a HTTP server library4, it is very limited. Therefore the approach is quite different from the rest of cases.

  • The REST server is based on the J2EE platform implemented using the Spring framework.
  • The server receives the request and executes a R script (creating a new session every time) using an Rserve instance.
  • In order to increase the performance, there are some optimizations:
    • The server calls to a cluster of 2 Rserver by round robin scheduling.
    • The R script cache the model using an In Memory DB (Redis).

Scikit-learn scenario

In this scenario a Python server has been implemented which will give the best performance for the Scikit-learn library.

  • The REST server is implemented in Python using the Flask framework.
  • The server receives the request and executes a Scikit-learn model.
  • In order to optimize the performance, the model is cached in memory.

MLLib scenario

In the last scenario, a pure Java solution has been developed giving the best performance for MLLib.

  • The REST server is implemented in Java using the Spring framework.
  • The server receives the request and executes an MLLib model.
  • In order to optimize the performance, the model is cached in memory.

Benchmark Specifications

These are specs for the model training:

  • The chosen model is a Random Forest classifier with 25 trees in the forest.
  • The data source is of 100000 entries with 3 input variables and 1 output.
  • In order to obtaining numeric features the algorithm used is FeatureHasher.

For the load tests:

  • The specs for the server machine is a large AWS instance with 2 cores, 8GB RAM and SSD disks.
  • The tools used to measure the performance will be JMeter.
  • Three different test scenarios have been selected:
    • Low traffic: 200 requests using 4 threads.
    • Average traffic: 500 requests using 10 threads.
    • High traffic: 1000 requests using 10 threads.

Test results

Low Traffic # request Avg(ms) Min(ms) Max(ms) Req/seg
R 200 938 516 1735 4.18
Scikit-learn 200 164 141 994 23.11
MLLib 200 65 51 130 45.58


Average Traffic # request Avg(ms) Min(ms) Max(ms) Req/seg
R 500 2411 818 2807 4.1
Scikit-learn 500 366 141 445 25.97
MLLib 500 67 51 219 115.15


High Traffic # request Avg(ms) Min(ms) Max(ms) Req/seg
R 1000 4833 2223 5244 4.11
Scikit-learn 1000 748 142 821 25.71
MLLib 1000 57 51 119 246.91


After reviewing the results of the Benchmarking, it is obvious to conclude that the fastest option is MLLib. In all the scenarios it has only taken less than 60 ms to process a request. The next fastest solution is Scikit-learn. Although it has been able to respond in less than a second, any real time system could not afford that process time even if it is less than 200ms with low traffic. Finally, it is *R* which has shown really bad results. In every scenario, the response time was longer than 1 second and under anony circumstances can it be used for predicting in a real time system. Not only is it really slow, it has pushed the system to the limit.


  • Pros:
    • Wide range of ML models.
    • Widely used by the scientific community.
  • Cons:
    • Very low performance
    • Not scalable and distributable model training.
    • Higher maintenance costs because it would require a Rserve cluster of servers.


  • Pros:
    • Wide range of ML models
    • Widely used by the scientific community
  • Cons:
    • Low performance
    • Not scalable and distributable model training


  • Pros:
    • Good performance
    • Scalable and distributable model training
  • Cons:
    • Small range of ML models

Leave a comment

Please be polite. We appreciate that. Your email address will not be published and required fields are marked