I've built an XGBoost model and seek to examine the individual estimators. For reference, this was a binary classification task with discrete and continuous input features. The input feature matrix is a scipy.sparse.csr_matrix
.
When I went to examine an individual estimator, however, I found difficulty interpreting the binary input features, such as f60150
below. The real-valued f60150
in the bottommost chart is easy to interpret - its criterion is in the expected range of that feature. However, the comparisons being made for the binary features, <X> < -9.53674e-07
doesn't make sense. Each of these features are either 1 or 0. -9.53674e-07
is a very small negative number, and I imagine this is just some floating-point idiosyncrasy within XGBoost or its underpinning plotting libraries, but it doesn't make sense to use that comparison when the feature is always positive. Can someone help me understand which direction (i.e. yes, missing
vs. no
corresponds to which true/false side of these binary feature nodes?
Here is a reproducible example:
import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt
def booleanize_csr_matrix(mat):
''' Convert sparse matrix with positive integer elements to 1s '''
nnz_inds = mat.nonzero()
keep = np.where(mat.data > 0)[0]
n_keep = len(keep)
result = scipy.sparse.csr_matrix(
(np.ones(n_keep), (nnz_inds[0][keep], nnz_inds[1][keep])),
shape=mat.shape
)
return result
### Setup dataset
res = fetch_20newsgroups()
text = res.data
outcome = res.target
### Use default params from CountVectorizer to create initial count matrix
vec = CountVectorizer()
X = vec.fit_transform(text)
# Whether to "booleanize" the input matrix
booleanize = True
# Whether to, after "booleanizing", convert the data type to match what's returned by `vec.fit_transform(text)`
to_int = True
if booleanize and to_int:
X = booleanize_csr_matrix(X)
X = X.astype(np.int64)
# Make it a binary classification problem
y = np.where(outcome == 1, 1, 0)
# Random state ensures we will be able to compare trees and their features consistently
model = XGBClassifier(random_state=100)
model.fit(X, y)
plot_tree(model, rankdir='LR'); plt.show()
Running the above with booleanize
and to_int
set to True
yields the following chart:
Running the above with booleanize
and to_int
set to False
yields the following chart:
Heck, even if I do a really simple example, I get the "right" results, regardless of whether X
or y
are integer or floating types.
X = np.matrix(
[
[1,0],
[1,0],
[0,1],
[0,1],
[1,1],
[1,0],
[0,0],
[0,0],
[1,1],
[0,1]
]
)
y = np.array([1,0,0,0,1,1,1,0,1,1])
model = XGBClassifier(random_state=100)
model.fit(X, y)
plot_tree(model, rankdir='LR'); plt.show()
from xgboost.plot_tree: binary feature interpretation
No comments:
Post a Comment