Machine Learning Software Tools

There exist numerous software packages that implement the machine learning models and computing algorithms described in the previous section. While we are not able to illustrate all such packages in this chapter, we select several representative packages and demonstrate how to employ them to achieve machine learning tasks using different ML methods such as Least Squares Support Vector Machine with Radial Basis Function Kernel, Random Forest, Stochastic Gradient Boosting, Xgboost, or Extremely Randomized Tree.

H20 is a visionary open-source software company committed to providing novel scalable machine learning technologies. Besides its web interface, also provides interfaces to R, Python, Scala, Java, JSON, and CoffeeScript/JavaScript. H20 supports commonly used machine learning algorithms, such as random forest, gradient boosting, generalized linear model, deep learning, and more. We use R interface here, which happens to be very flexible when tuning hyperparameters and defining customized constraints (e.g., training time, optimization of hyperparameter search,

TABLE 10.1

Available Models of H20



• Cox Proportional Hazards (CoxPH)

• Aggregator

• Deep Learning (Neural Networks)

• Generalized Low Rank Models (GLRM)

• Distributed Random Forest (DRF)

• Isolation Forest

• Generalized Linear Model (GLM)

• К-Means Clustering

• Gradient Boosting Machine (GBM)

• Principal Component Analysis (PCA)

• Naive Baves Classifier


• Stacked Ensembles

• Word2vec

• Support Vector Machine (SVM)

MOJO Models

• XGBoost

• MOJO Models

maximum models to be trained, and stopping metric). The available models are listed in Table 10.1.

Here we give some example code for H20-based machine learning methods. First, we load the H20 library in R and prepare the training and testing datasets.

One important step is to convert the training and testing datasets into H20 objects:

We then perform parameter tuning by random search, which is indicated by "RandomDiscrete" in the search criteria. H20 has a grid search strategy as well, and we just need to specify the search strategy as "Cartesian". The evaluation metric we use is AUC; of course, there are other metrics available as well, for example, MSE, MAE, RMSE, and deviance.


The Caret package (Classification and Regression Training) in R is a versatile library that can handle multiple tasks. So far, 238 statistical or machine learning models are available in this package, both classification and regression models included (e.g., AdaBoost Classification Trees, k-Nearest Neighbors, etc.).

The basic process of creating predictive models is,

data splitting

  • • pre-processing
  • • feature selection
  • • model tuning using resampling
  • • variable importance estimation

Here is the code example of data splitting and preprocessing, using the example data "Sonar" from "mlbench".

Split dataset to training and testing data sets via the following code:

Here is an example of 10-fold repeated cross-validation parameter tuning:

After we decided to use the cross-validation method, here is an example of a simple boosted tree model, where the "method" option can be changed to many other methods and the "trControl" was defined as the 10-fold repeated cross-validation described before. After the model fitting with the training dataset, the prediction model was fitted for the testing data set.

Sometimes, we also want to identify the important variables that are related or most predictive to the outcome, the following code below can be used to evaluate the variable importance and a plot of top 20 important variables is produced.

More details on the package of Caret can be found in the book Applied Predictive Modeling (Kuhn and Johnson 2013) and a paper in the Journal of Statistical Software (Kuhn 2008).


The Tree-Based Pipeline Optimization Tool (ТРОТ) is an open-source AutoML software package in Python built on the scikit-learn library. The goal of ТРОТ is to automate the design and implementation of machine learning pipelines using a stochastic optimization process to make machine learning more friendly and accessible to non-experts (Olson et al. 2016).

ТРОТ optimizes a pipeline by starting from several simple and randomly chosen pipelines (the population). For every iteration of the optimization process (a generation), ТРОТ makes several copies of the current bestperforming pipelines in the population and then applies random changes to them (e.g., adding or removing an operation or tuning a parameter setting of one operation). At the end of every generation, the worst-performing pipelines are removed from the population, and ТРОТ proceeds to the next generation. After a fixed number of generations, ТРОТ recommends the best performing pipeline that it has created during the optimization process. ТРОТ is very user-friendly, even people with no data analysis experience can easily learn and use it. You only need to specify the generations and populations you want to generate.


Auto-sklearn is also an open-source AutoML toolkit, built upon the scikit- learn library. The main goal of this package is to automatically select algorithms and tune parameters. There are 4 data preprocessing methods, 14 preprocessing methods, and 15 classification algorithms included in this package (Feurer et al. 2015).

In our application example, the algorithms we used are logistic regression, SVM, random forest, gradient boosting, and Xgboost. We used 5-fold cross-validation and the same tuning time to compare the AUC between different machine learning algorithms (see next section below). Auto-sklearn is pretty flexible in defining training time and tuning options, and the code is easy to understand, as you can tell from the following example:

This example also shows how to use 5-fold cross-validation. One more thing to mention is that you have to reload the data and use refit function before calculating AUC in order to make cross-validation work properly.

< Prev   CONTENTS   Source   Next >