H2O4GPU is a collection of GPU solvers by H2O.ai with APIs in Python and R. The Python API builds upon the easy-to-use scikit-learn API. The h2o4gpu R package is a wrapper around the h2o4gpu Python package.
The R package makes use of RStudio's reticulate package for facilitating access to Python libraries through R. Reticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability.
H2O4GPU is a new project under active development and we are looking for contributors! If you find a bug, please check that we have not already fixed the issue in the bleeding edge version and then check that we do not already have an issue opened for this topic. If not, then please file a new issue with a reproducible example.
There are a few system requirements, including Linux with glibc 2.17+, Python >=3.6, R >=3.1, CUDA 8 or 9, and a machine with Nvidia GPUs. The code should still run if you have CPUs, but it will fall back to scikit-learn CPU based versions of the algorithms.
The h2o4gpu Python module is a prerequisite for the R package. So first, follow the instructions here to install the h2o4gpu Python package (either at the system level or in a Python virtual envivonment). The easiest thing to do is to
pip install the stable release
whl file. To ensure compatibility, the Python package version number should match the R package version number.
The recomended way of installing the R package can is from CRAN using
install.packages("h2o4gpu"). To install the development version of the h2o4gpu R package, you can install directly from GitHub as follows:
library(devtools) devtools::install_github("h2oai/h2o4gpu", subdir = "src/interface_r")
Using a Python virtual environment is a good solution if you don't want to upgrade your main Python installation to 3.6. If you installed the h2o4gpu Python module into a virtual environment, you will have to add a line of code to tell R which Python envivonment you want to use:
library(reticulate) use_virtualenv("/home/username/venv/h2o4gpu") # set this to the path of your venv
If you have installed h2o4gpu Python module using Anaconda, then you can use the
use_condaenv() function instead. More information about Python environment configuration is available in the reticulate user guide.
Here's a quick demo of how to train and evaluate a GPU-based Random Forest classifier model. We will use the classic Iris dataset, which is a three-class classification problem and evaluate the performance of the model using classification error.
library(h2o4gpu) library(reticulate) # only needed if using a virtual Python environment use_virtualenv("/home/username/venv/h2o4gpu") # set this to the path of your venv # Prepare data x <- iris[1:4] y <- as.integer(iris$Species) # all columns, including the response, must be numeric # Initialize and train the classifier model <- h2o4gpu.random_forest_classifier() %>% fit(x, y) # Make predictions pred <- model %>% predict(x) # Compute classification error using the Metrics package (note this is training error) library(Metrics) ce(actual = y, predicted = pred)
H2O4GPU contains a collection of popular algorithms for supervised learning: Random Forest, Gradient Boosting Machine (GBM) and Generalized Linear Models (GLMs) with Elastic Net regularization. There are methods for regression and classification for each of these algorithms. Both Random Forest and GBM support multiclass clasification, however the GLM currently only supports binomial classification (a ticket for multinomial support is open here).
The tree based models (Random Forest and GBM) are built on top of the very powerful XGBoost library, and the Elastic Net GLM has been built upon the POGS solver. Proximal Graph Solver (POGS) is a solver for convex optimization problems in graph form using Alternating Direction Method of Multipliers (ADMM). We have found that this method is not as fast as we'd like it to be, so we are working on implementing an entirely new GLM from scratch (follow progress here).
The h2o4gpu R package does not include a suite of internal model metrics functions, therefore we encourage users to use a third-party model metrics package of their choice. For all the examples below, we will use the Metrics R package. This package has a large number of model metrics functions, all with a very simple, unified API.
In this example, we will train and test three different models on a subset of the HIGGS dataset. The goal in this dataset is to distinguish between signal "1" and background "0", so this is a binary classification problem. The features are all numeric.
H2O4GPU requires all feature and response columns to be numeric, so in this case, we don't have to do any pre-processing of the data. If your response column is a factor, then you can simply convert the levels to integer values using
as.integer(). If you have categorical/factor columns among your features, you must apply an encoding method to convert the columns into numeric data. Some options are label encoding (simply convert the levels to integers) or one hot encoding (binary indicator columns, one for each categorical level). For simplicity, in this tutorial, we will always use label encoding, however you can read more about different types of encodings here.
# Load a sample dataset for binary classification # Source: https://archive.ics.uci.edu/ml/datasets/HIGGS train <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv") test <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv") # Create train & test sets (column 1 is the response) x_train <- train[, -1] y_train <- train[, 1] x_test <- test[, -1] y_test <- test[, 1]
Below we see that the h2o4gpu modeling functions follow a two-phased functional apporach. The two phased approach to modeling (first initialize model, then train) is more common in Python, and we borrow that paradigm here. We blend this with the the functional pipe syntax in R.
First you define the model with it's hyperparameters, for example,
h2o4gpu.gradient_boosting_classifier(n_estimators = 500, subsample = 0.8). Then we pipe the initialized model object to the
fit(x, y) function to train the model, and save the resulting object.
# Train three different binary classification models model_gbc <- h2o4gpu.gradient_boosting_classifier() %>% fit(x_train, y_train) model_rfc <- h2o4gpu.random_forest_classifier() %>% fit(x_train, y_train) model_enc <- h2o4gpu.elastic_net_classifier() %>% fit(x_train, y_train)
We pipe our trained models to the familiar
predict() method. In binary classification, we are often more interested in the numeric predicted values, rather than the predicted class labels. We follow the same design as the
predict() function in the popular caret package, which allows the user to specify which type of predictions they want to return using the
type argument. This defaults to
"raw" which in classification, yields predicted class labels. When we set it to
"prob", it returns the (uncalibrated) class probabilities. This is not mentioned often in modeling software documentation, but you should note that despite using the term "probabilities", these predicted values do not represent actual probabilities unless some method like Platt scaling is used for calibration. This is true for all machine learning packages, including caret, h2o, and h2o4gpu (though we do offer the option to perform Platt scaling inside the h2o R package).
# Generate predictions (type "prob" gives predicted values instead of predicted label) pred_gbc <- model_gbc %>% predict(x_test, type = "prob") pred_rfc <- model_rfc %>% predict(x_test, type = "prob") pred_enc <- model_enc %>% predict(x_test, type = "prob")
Let's take a look at what the output of the
predict() function looks like in binary classification. It will be a two-column matrix with the column names set to the names of the classes.
To compute AUC of a binary classification model, we use the predicted values of the second column (the "positive" class) and pass that to the
# Compare test set performance using AUC auc(actual = y_test, predicted = pred_gbc[, 2]) auc(actual = y_test, predicted = pred_rfc[, 2]) auc(actual = y_test, predicted = pred_enc[, 2])
Now that we are familiar with binary classification, there is not much more to say about multiclass classification. The predict output will have the same format as binary classification, except that if you use
type = "prob" number of columns will match the number of classes. Often in multiclass classification, you may be interested in the predicted class label and misclassification error, which we've demonstrated already in the Quickstart section.
In this next exercise, we will compare a GBM and GLM regression model. Until this issue is respolved, we don't recommend that you use the Random Forest regressor, as there are some bugs that are severely affecting model performance.
We will predicting the age of abalone from physical measurements, using the Abalone dataset.
# Load a sample dataset for regression # Source: https://archive.ics.uci.edu/ml/datasets/Abalone df <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", header = FALSE) str(df)
There is one categorical/factor column in this dataset, so we will first convert those values to integers (label encoding). Recall that label encoding is just one way of encoding the categorical column and that there may be other ways that produce better results in terms of model performance.
df[, 1] <- as.integer(df[, 1]) #label encode the one factor column
In this case, we started with a single data frame, so we should break the data into train and test splits at random. We can do that easily in R by sampling 80% of the row indices and subsetting the data frame by row.
# Randomly sample 80% of the rows for the training set set.seed(1) train_idx <- sample(1:nrow(df), 0.8*nrow(df)) # Create train & test sets (column 9 is the response) x_train <- df[train_idx, -9] y_train <- df[train_idx, 9] x_test <- df[-train_idx, -9] y_test <- df[-train_idx, 9]
# Train two different regression models model_gbr <- h2o4gpu.gradient_boosting_regressor() %>% fit(x_train, y_train) model_enr <- h2o4gpu.elastic_net_regressor() %>% fit(x_train, y_train) # Generate predictions pred_gbr <- model_gbr %>% predict(x_test) pred_enr <- model_enr %>% predict(x_test)
In regression, the
predict() function always returns a vector of predictions (not a data frame).
In regression problems, Mean Squared Error (MSE), is a common metric for model evaluation. We will use test set MSE to evaluate and compare our two models.
# Compare test set performance using MSE mse(actual = y_test, predicted = pred_gbr) mse(actual = y_test, predicted = pred_enr)
In this case, which is not usual, the GBM drastically outperforms the GLM.
The unsupervised learning algorithms in h2o4gpu include K-Means, Principal Component Analysis (PCA), and Truncated Singular Value Decompostion (SVD).
First we will train a K-Means model. Let's create a train and test set from the iris dataset.
# Prepare data iris$Species <- as.integer(iris$Species) # convert to numeric data # Randomly sample 80% of the rows for the training set set.seed(1) train_idx <- sample(1:nrow(iris), 0.8*nrow(iris)) train <- iris[train_idx, ] test <- iris[-train_idx, ]
Train a K-Means model with three clusters.
model_km <- h2o4gpu.kmeans(n_clusters = 3L) %>% fit(train)
Once you have trained a K-Means model, applying the
transform() function to a dataset transforms your points into distances from each centroid. So your
p matrix becomes
n is the number of observations,
p the number of features and
k the number of clusters).
test_dist <- model_km %>% transform(test) head(test_dist)
Let's use the HIGGS train and test datasets again for demonstration.
# Load a sample dataset for binary classification # Source: https://archive.ics.uci.edu/ml/datasets/HIGGS train <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv") test <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
Train a PCA model with 4 components and apply the transformation onto a dataset. Once you have created a projection model from a dataset, you can apply that transformation to a new dataset (such as a test set) using the
model_pca <- h2o4gpu.pca(n_components = 4) %>% fit(train) test_transformed <- model_pca %>% transform(test)
Train a truncated SVD model with 4 components and apply the transformation on a test set.
model_tsvd <- h2o4gpu.truncated_svd(n_components = 4) %>% fit(train) test_transformed <- model_tsvd %>% transform(test)