In [1]:
from __future__ import print_function
import sklearn
import sklearn.datasets
import sklearn.ensemble
import pandas as pd
import numpy as np
import lime
import lime.lime_tabular
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator

np.random.seed(1)
In [2]:
# Start an H2O virtual cluster that uses 6 gigs of RAM and 6 cores
h2o.init(max_mem_size = "500M", nthreads = 6) 

# Clean up the cluster just in case
h2o.remove_all()
Checking whether there is an H2O instance running at http://localhost:54321. connected.
H2O cluster uptime: 2 mins 30 secs
H2O cluster version: 3.10.4.8
H2O cluster version age: 16 days
H2O cluster name: H2O_from_python_marcotcr_ph8pt1
H2O cluster total nodes: 1
H2O cluster free memory: 5.314 Gb
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy: None
H2O internal security: False
Python version: 2.7.13 final

Wrapper class

We need a wrapper class that makes an H2O distributed random forest behave like a scikit learn random forest. We instantiate the class with an h2o distributed random forest object and column names. The predict_proba method takes a numpy array as input and returns an array of predicted probabilities for each class.

In [3]:
class h2o_predict_proba_wrapper:
    # drf is the h2o distributed random forest object, the column_names is the
    # labels of the X values
    def __init__(self,model,column_names):
            
            self.model = model
            self.column_names = column_names
 
    def predict_proba(self,this_array):        
        # If we have just 1 row of data we need to reshape it
        shape_tuple = np.shape(this_array)        
        if len(shape_tuple) == 1:
            this_array = this_array.reshape(1, -1)
            
        # We convert the numpy array that Lime sends to a pandas dataframe and
        # convert the pandas dataframe to an h2o frame
        self.pandas_df = pd.DataFrame(data = this_array,columns = self.column_names)
        self.h2o_df = h2o.H2OFrame(self.pandas_df)
        
        # Predict with the h2o drf
        self.predictions = self.model.predict(self.h2o_df).as_data_frame()
        # the first column is the class labels, the rest are probabilities for
        # each class
        self.predictions = self.predictions.iloc[:,1:].as_matrix()
        return self.predictions

Continuous features

Loading data, training a model

For this part, we'll use the Iris dataset, and we'll replace the scikit learn random forest with an h2o distributed random forest.

In [4]:
iris = sklearn.datasets.load_iris()

# Get the text names for the features and the class labels
feature_names = iris.feature_names
class_labels = 'species'
In [5]:
# Generate a train test split and convert to pandas and h2o frames

train, test, labels_train, labels_test = sklearn.model_selection.train_test_split(iris.data, iris.target, train_size=0.80)

train_h2o_df = h2o.H2OFrame(train)
train_h2o_df.set_names(iris.feature_names)
train_h2o_df['species'] = h2o.H2OFrame(iris.target_names[labels_train])
train_h2o_df['species'] = train_h2o_df['species'].asfactor()

test_h2o_df = h2o.H2OFrame(test)
test_h2o_df.set_names(iris.feature_names)
test_h2o_df['species'] = h2o.H2OFrame(iris.target_names[labels_test])
test_h2o_df['species'] = test_h2o_df['species'].asfactor()
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
In [6]:
# rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)
# rf.fit(train, labels_train)

iris_drf = H2ORandomForestEstimator(
        model_id="iris_drf",
        ntrees=500,
        stopping_rounds=2,
        score_each_iteration=True,
        seed=1000000,
        balance_classes=False,
        histogram_type="AUTO")

iris_drf.train(x=feature_names,
         y='species',
         training_frame=train_h2o_df)
drf Model Build progress: |███████████████████████████████████████████████| 100%
In [7]:
# sklearn.metrics.accuracy_score(labels_test, rf.predict(test))

iris_drf.model_performance(test_h2o_df)
ModelMetricsMultinomial: drf
** Reported on test data. **

MSE: 0.0270187026781
RMSE: 0.164373667837
LogLoss: 0.0874284925626
Mean Per-Class Error: 0.025641025641
Confusion Matrix: vertical: actual; across: predicted

setosa versicolor virginica Error Rate
11.0 0.0 0.0 0.0 0 / 11
0.0 12.0 1.0 0.0769231 1 / 13
0.0 0.0 6.0 0.0 0 / 6
11.0 12.0 7.0 0.0333333 1 / 30
Top-3 Hit Ratios: 
k hit_ratio
1 0.9666666
2 1.0
3 1.0
Out[7]:

Convert h2o to numpy array

The explainer requires numpy arrays as input and h2o requires the train and test data to be in h2o frames. In this case we could just use the train and test numpy arrays but for illustrative purposes here is how to convert an h2o frame to a pandas dataframe and a pandas dataframe to a numpy array.

In [8]:
train_pandas_df = train_h2o_df[feature_names].as_data_frame() 
train_numpy_array = train_pandas_df.as_matrix() 

test_pandas_df = test_h2o_df[feature_names].as_data_frame() 
test_numpy_array = test_pandas_df.as_matrix() 

Create the explainer

As opposed to lime_text.TextExplainer, tabular explainers need a training set. The reason for this is because we compute statistics on each feature (column). If the feature is numerical, we compute the mean and std, and discretize it into quartiles. If the feature is categorical, we compute the frequency of each value. For this tutorial, we'll only look at numerical features.

We use these computed statistics for two things:

  1. To scale the data, so that we can meaningfully compute distances when the attributes are not on the same scale
  2. To sample perturbed instances - which we do by sampling from a Normal(0,1), multiplying by the std and adding back the mean.
In [9]:
#explainer = lime.lime_tabular.LimeTabularExplainer(train, feature_names=iris.feature_names, class_names=iris.target_names, discretize_continuous=True)

explainer = lime.lime_tabular.LimeTabularExplainer(train_numpy_array,
                                                   feature_names=feature_names,
                                                   class_names=iris.target_names,
                                                   discretize_continuous=True)

Create the predictor wrapper instance for the h2o drf

We have a trained h2o distributed random forest that we would like to explain. We need to create a wrapper instance that will make it behave as a scikit learn random forest for our Lime explainer.

In [10]:
h2o_drf_wrapper = h2o_predict_proba_wrapper(iris_drf,feature_names) 

Explaining an instance

Since this is a multi-class classification problem, we set the top_labels parameter, so that we only explain the top class.

In [11]:
# i = np.random.randint(0, test_numpy_array.shape[0])

i = 27
# exp = explainer.explain_instance(test[i], rf.predict_proba, num_features=2, top_labels=1)

exp = explainer.explain_instance(test_numpy_array[i], h2o_drf_wrapper.predict_proba, num_features=2, top_labels=1)
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf prediction progress: |████████████████████████████████████████████████| 100%

We now explain a single instance:

In [12]:
exp.show_in_notebook(show_table=True, show_all=False)