%%HTML
<style>
div.prompt {display:none}
</style>
I really struggled to have the sklearn
, XGBoost
and CatBoost
APIs function smoothly with the PDPBox
, eli5
, lime
and shap
packages. Some just accept pd.DataFrame
but then the explanation part is compatible with np.array
only, or the other way around. Also, not all classifiers output ptobabilities by default. For XGBoost
a score is produced and it needs to be fed into a sigmoid function to convert it into a probability. Nothing dramatic per se. It just means some extra care must be used when turning towards those methods.
For the time being, I managed to get both sklearn
and XGBoost
to work with all the relevant explanation-related packages, with the caveat that PDPBox
only accepts models tained on pd.DataFrame
. For CatBoost
, instead, I could not get a lime
pipeline to work, even though shap
does just fine.
Molnar, Christoph. "Interpretable machine learning. A Guide for Making Black Box Models Explainable", 2019.
https://christophm.github.io/interpretable-ml-book/.
%load_ext autoreload
%autoreload 2
%matplotlib inline
from interpretable_ml_utils import *
shap.initjs()
path = Path("C:/Users/francesco.pochetti/Notebooks/data/Varied")
fname = path/"adult.feather"
if not os.path.isfile(fname):
cols = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
names=cols)
df.to_feather(fname)
else:
df = pd.read_feather(fname)
df.shape
df.head()
missing_values(df)
features = ['age', 'workclass', 'education', 'marital_status', 'occupation',
'relationship', 'race', 'sex', 'hours_per_week', 'native_country',
'capital_gain', 'capital_loss']
for c in features+['income']:
if not is_numeric_dtype(df[c]):
df[c] = df[c].str.strip()
df = reduce_cardinality(df, c)
raw_train, raw_valid = train_test_split(df[features+['income']], train_size=0.8, random_state=1234)
raw_train.shape, raw_valid.shape
display_all(raw_train.head())
cats = train_cats(raw_train)
X_train, y_train, nas = proc_df(raw_train, y_fld='income', max_n_cat=15)
apply_cats(raw_valid, raw_train)
X_valid, y_valid, _ = proc_df(raw_valid, y_fld='income', na_dict=nas, max_n_cat=15)
feats = list(X_train.columns)
raw_train.shape, raw_valid.shape, X_train.shape, X_valid.shape
cats['workclass']
clf = xgb.XGBClassifier(learning_rate=0.03, n_estimators=300, n_jobs=-1)
clf.fit(X_train, y_train)
_ = print_report(clf, X_valid, y_valid, t=0.5, cal=False, roc=False,
conf=False, classes=cats['income'])
#clf = RandomForestClassifier(n_estimators=150, min_samples_leaf=5, n_jobs=-1)
#clf.fit(X_train, y_train)
#_ = print_report(clf, X_valid, y_valid, t=0.5, cal=False, roc=False,
# conf=False, classes=cats['income'])
The questions this approach answers are:
- Nice interpretation: Feature importance is the increase in model error when the feature’s information is destroyed.
- The importance measure automatically takes into account all interactions with other features.
- It is very unclear whether you should use training or test data to compute the feature importance.
- If features are correlated, the permutation feature importance can be biased by unrealistic data instances.
- Adding a correlated feature can decrease the importance of the associated feature by splitting the importance between both features.
try: imp = get_permutation_imp(clf, X_train, y_train, feats=feats)
except: imp = get_permutation_imp(clf, X_train.values, y_train, feats=feats)
Methodology currently implemented in scikit-learn. Source SO
It is sometimes called "gini importance" or "mean decrease impurity" and it is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble
Compared to permutation-based feature importance, this method comes at no additional cost (the gains are directly calculated as part of training) and does not suffer from the drawbacks of having potentially correlated features, as datapoints are not shuffled.
Probably the biggest disadvantage is the fact that the absolute importances are less interpretable, compared to the more concise and transparent increase in error rate we had before.
Note that, as expected, the two illustrated feature importance methods yield slightly different results in terms of the ordering the top variables.
get_styled_feat_importance(clf, feats, 15)
The idea behind PDP is to check the relationship between the dependent variable and a feature of interest, in an approach allowing to isolate the dependence of the output wrt an input, all other things being equal.
This is achieved in the following way:
x
:Say that your dataset is composed of n
data points. Each time you run 2B, you'd get n
predictions, e.g. you'd come up with N x n
predictions at the end of the procedure.
Now, you have two options:
n
lines over N data points on the X-axis. This is what ICE plots look like.The charts below show both all lines and the average at the same time.
- The computation of partial dependence plots is intuitive: the partial dependence function at a particular feature value represents the average prediction if we force all data points to assume that feature value.
- If the feature for which you computed the PDP is not correlated with the other features, then the PDPs perfectly represent how the feature influences the prediction on average. In the uncorrelated case, the interpretation is clear.
- Unlike partial dependence plots, ICE curves can uncover heterogeneous relationships.
- The realistic maximum number of features in a partial dependence function is two.
- The assumption of independence is the biggest issue with PD plots.
- To meaningfully plot both PDPs and ICE for categorical variables, the variable must be one-hot-encoded. This is a constraint as tree-ensemble algorithms (the de-facto go-to solution in the non-deep-learning world) can easily handle features which are not encoded in this way.
WARNING! As you notice from the first plot below (capital_gain
), the technique does not work for highly skewed features. I could not manage to get PDPBox
to reduce the range of the variable in question, cutting high quantiles.
plot_pdp(clf, X_valid.sample(2000), feats, 'capital_gain')
plot_pdp(clf, X_valid.sample(2000), feats, 'age', center=False)
plot_pdp(clf, X_valid.sample(2000), feats, 'hours_per_week')
cat = 'marital_status'
f = [c for c in feats if c.startswith(cat)]
if len(f)>1: plot_pdp(clf, X_valid.sample(2000), feats, f, classes=cats[cat], plot_params={'title': f'PDP for feature {cat}'}, center=False)
cat = 'occupation'
f = [c for c in feats if c.startswith(cat)]
if len(f)>1: plot_pdp(clf, X_valid.sample(2000), feats, f, classes=cats[cat], plot_params={'title': f'PDP for feature {cat}'}, center=False)
cats = train_cats(raw_train)
X_train, y_train, nas = proc_df(raw_train, y_fld='income', max_n_cat=None)
apply_cats(raw_valid, raw_train)
X_valid, y_valid, _ = proc_df(raw_valid, y_fld='income', na_dict=nas, max_n_cat=None)
feats = list(X_train.columns)
raw_train.shape, raw_valid.shape, X_train.shape, X_valid.shape
Notice that in the below cell, in the case of XGBoost
, we call clf.fit(X_train.values, y_train)
, e.g. we fit on np.arrays
, and not on pd.DataFrame
. This is due to downstream-used packages (lime
, eli5
, shap
) not always being compatible with pandas
and XGBoost
or CatBoost
. The sklearn
API is always compatible with pandas
, though.
clf = xgb.XGBClassifier(learning_rate=0.03, n_estimators=300, n_jobs=-1)
clf.fit(X_train.values, y_train)
_ = print_report(clf, X_valid.values, y_valid, t=0.5, cal=False, roc=False,
conf=False, classes=cats['income'])
#clf = RandomForestClassifier(n_estimators=150, min_samples_leaf=5, n_jobs=-1)
#clf.fit(X_train, y_train)
#_ = print_report(clf, X_valid, y_valid, t=0.5, cal=False, roc=False,
# conf=False, classes=cats['income'])
imp = get_permutation_imp(clf, X_train.values, y_train, feats=feats)
get_styled_feat_importance(clf, feats)
The following approach is not known in the community under a specific name and it does not come with a paper.
Even though not highly advertised, it is incredibly powerful, and got integrated (in an independent implementation) within the eli5 library.
Ando Saabas proposed the method in this series of posts. The idea is that we can decompose the contribution of each feature to a specific prediction by following the path from the root of the tree to the node where the instance of interest resides. This is done by recording, at each split of the feature space, the difference in the proportion of the positive class (in case of classification) between the parent and the child node and assigning that difference to the variable in question. If we start from the root, the percentage of the positive class would just represent the average across the entire training set. Let's assume this is 70%. Now assume that the first split happens on feature X_1
, that our instance of interest sits on the right branch of the tree and that within this subset the percentage of the positive class is 80%. Then we can safely state that X_1
drives an increase in probability of 10%. Imagine doing this until the tree is fully grown and averaging that out across all trees (in case of a forest).
This is how the method works in a nutshell.
The contributions, by definition, add up to the final model's outcome, making them easy to interpret and to visualize. Both a tabular form, as eli5
proposes, and a more business-oriented waterfall chart, are nice ways of doing so.
This method relies on the metrics which get actually computed at each tree's node.
In case of XGBoost
, the algorithm calculates a score which cannot direcly be interpreted as a probability, as happening in RandomForest, so beware of the numbers eli5
displays.
# id = X_valid.sample().index.values[0]
# fixing the ID value for illustration purposes
id = 20150
explain_pred_contrib(id, clf, X_valid, feats, cats=cats, waterfall={'rotation_value':60, 'threshold': None})
The principle behind LIME is based on fitting a local explainable model (e.g. a linear regression for instance) on the predictions of the black-box model. First thing to notice, LIME does not fit on the actuals, but on the black-models outputs. Those are the ones we have to explain, after all.
The recipe for training local surrogate models:
- Select your instance of interest for which you want to have an explanation of its black box prediction.
- Perturb your dataset and get the black box predictions for these new points.
- Weight the new samples according to their proximity to the instance of interest.
- Train a weighted, interpretable model on the dataset with the variations.
- Explain the prediction by interpreting the local model.
- Explanations are short (= selective) and possibly contrastive.
- It works for tabular data, text and images.
- The fidelity measure (how well the interpretable model approximates the black box predictions) gives us a good idea of how reliable the interpretable model is.
- The correct definition of the neighborhood is a very big, unsolved problem when using LIME with tabular data. In my opinion it is the biggest problem with LIME and the reason why I would recommend to use LIME only with great care. [...] LIME currently uses an exponential smoothing kernel to define the neighborhood. A smoothing kernel is a function that takes two data instances and returns a proximity measure. The kernel width determines how large the neighborhood is: A small kernel width means that an instance must be very close to influence the local model, a larger kernel width means that instances that are farther away also influence the model. If you look at LIME’s Python implementation (file lime/lime_tabular.py) you will see that it uses an exponential smoothing kernel (on the normalized data) and the kernel width is 0.75 times the square root of the number of columns of the training data.
- Another really big problem is the instability of the explanations. In an article the authors showed that the explanations of two very close points varied greatly in a simulated setting.
Conclusion: Local surrogate models, with LIME as a concrete implementation, are very promising. But the method is still in development phase and many problems need to be solved before it can be safely applied.
WARNING! The below explanation makes, unfortunately, no sense, showing LIME's limitations. I decided to keep it for illustration purposes.
When asking to explain an instance, the LIME library produces 3 outputs (see below):
>50K
) and non-event (<=50K
) occuring.capital_gain
bar extending to the left of the black vertical line, if capital_gain
would not be <=0
(it is 0
for this specific data point), the probability of the event >50K
would be 0.63
higher. Differently stated, capital_gain<=0
pushes the prediction towards the non-event by 0.63
. The second orange bar, instead, can be interpreted as: if marital_status
would have been different than Married-civ-spouse
, the probability of the event would be 0.13
lower, so this feature is acting in favour of >50K
. Note that this explanation contradicts the prediction of the person earning more than 50K happening with probability 60%. Another resource to look at to understand how to interpret those numbers is the official LIME Github repo: the screenshot sections provides a couple of interesting examples.An interesting fact to call out is the significant difference between LIME local prediction and the black-box model prediction. There is no guarantee those two would be equal, but still, such a big difference should hold us back from trusting the explanations as they might be inconsistent, as is actually the case.
explainer = get_lime_explainer(X_train, cats, feats, kernel_width=None)
exp = explain_pred_lime(id, clf, explainer, X_valid, num_features=5)
Fiddling around with kernel_width
to try to get better explanations...
If we reduce the kernel_width
to 0.5
things improve a lot
explainer = get_lime_explainer(X_train, cats, feats, kernel_width=0.5)
exp = explain_pred_lime(id, clf, explainer, X_valid, num_features=5)
#if 'XGB' in str(type(clf)): print(measure_lime_accuracy(clf, explainer, X_valid.sample(2000), num_features=5))
Quotes taken from here and here
Goal
The goal of SHAP is to explain the prediction of an instance x by computing the contribution of each feature to the prediction. The SHAP explanation method computes Shapley values from coalitional game theory.[...] A prediction can be explained by assuming that each feature value of the instance is a “player” in a game where the prediction is the payout. Shapley values [...] tells us how to fairly [e.g. in an additive fashion] distribute the “payout” among the features.
Conditions Shapley values meet
source Consider the following scenario: a group of people are playing a game. As a result of playing this game, they receive a certain reward; how can they divide this reward between themselves in a way which reflects each of their contributions? There are a few things which everyone can agree on; meeting the following conditions will mean the game is
fair
according to Shapley values:
- [Efficiency] The sum of what everyone receives should equal the total reward
- [Symmetry] If two people contributed the same value, then they should receive the same amount from the reward
- [Dummy] Someone who contributed no value should receive nothing
- [Additivity] If the group plays two games, then an individual’s reward from both games should equal their reward from their first game plus their reward from the second game
Intuition
An intuitive way to understand the Shapley value is the following illustration: The feature values enter a room in random order. All feature values in the room participate in the game (= contribute to the prediction). The Shapley value of a feature value is the average change in the prediction that the coalition already in the room receives when the feature value joins them.
source Shapley values calculate the importance of a feature by comparing what a model predicts with and without the feature. However, since the order in which a model sees features can affect its predictions, this is done in every possible order, so that the features are fairly compared.
Interpretation
The interpretation of the Shapley value for feature value
j
is: The value of the j-th feature contributedϕj
to the prediction of this particular instance compared to the average prediction for the dataset.Be careful to interpret the Shapley value correctly: The Shapley value is the average contribution of a feature value to the prediction in different coalitions. The Shapley value is NOT the difference in prediction when we would remove the feature from the model.
What does the SHAP package offer?
Single instance force plots
Each feature value is a force that either increases or decreases the prediction. The prediction starts from the baseline. The baseline for Shapley values is the average of all predictions. In the plot, each Shapley value is an arrow that pushes to increase (positive value) or decrease (negative value) the prediction. These forces balance each other out at the actual prediction of the data instance.
Aggregated SHAP-based feature importance
: sort features based on mean absolute Shapley values.Instance-level SHAP-based feature importance
: sort features by aggregated SHAP-based importance and show instance-granular Shapley values dispersed on the y-axis.Clustering instances based on Shapley values
Dependence plots
: showing relationship between Shapley values of feature j
and values of feature j
, with additional possibility to colour code by the intensity of a second variable.
- The difference between the prediction and the average prediction is fairly distributed among the feature values of the instance [e.g. it mst add up to the that difference] – the Efficiency property of Shapley values. This property distinguishes the Shapley value from other methods such as LIME. LIME does not guarantee that the prediction is fairly distributed among the features.
- The Shapley value is the only explanation method with a solid theory. The axioms – efficiency, symmetry, dummy, additivity – give the explanation a reasonable foundation. Methods like LIME assume linear behavior of the machine learning model locally, but there is no theory as to why this should work.
- The Shapley value requires a lot of computing time.
- The Shapley value can be misinterpreted. The Shapley value of a feature value is not the difference of the predicted value after removing the feature from the model training. The interpretation of the Shapley value is: Given the current set of feature values, the contribution of a feature value to the difference between the actual prediction and the mean prediction is the estimated Shapley value.
shap_explainer = get_shap_explainer('tree', clf)
shap_values = get_shap_values(shap_explainer, X_valid)
The red "forces" in the below plot push the probability of the event high, whereas the "blue" ones work against them pushing in the opposite direction. Starting from the base_value
, all forces add up to the final prediction.
# `link='logit'` only for classifiers such as XGBoost not producing probs directly
show_shap(shap_explainer.expected_value, shap_values, X_valid, id, cats=cats,
link='logit', matplotlib=True, feats=feats, clf=clf)
cat_cols = [c for c in cats.keys() if c!='income']
train_data = Pool(
data=X_train,
label=y_train,
cat_features=cat_cols
)
holdout_data = Pool(
data=X_valid,
label=y_valid,
cat_features=cat_cols
)
clf_c = CatBoostClassifier(
iterations=100,
random_seed=random_state,
learning_rate=0.1,
custom_loss=['AUC']
)
_ = clf_c.fit(train_data, eval_set=holdout_data, plot=True, verbose=False)
#clf_c.fit(X_train, y_train, cat_features=cat_cols, eval_set=(X_valid, y_valid), plot=True, verbose=False)
_ = print_report(clf_c, X_valid.values, y_valid, t=0.5, cal=False,
roc=False, conf=False, classes=cats['income'])
shap_explainer = get_shap_explainer('tree', clf_c)
shap_values = get_shap_values(shap_explainer, holdout_data)
shap_imp = get_shap_feat_importance(shap_values, holdout_data, feats)
get_shap_feat_importance(shap_values, X_valid)
show_shap(shap_explainer.expected_value, shap_values, X_valid, id, cats=cats,
matplotlib=True, link='logit', feats=feats, clf=clf_c)
clf_c.predict_proba(X_valid.loc[id].values)[1]
Checking additivity of Shapley values
assert_shap_proba(clf_c, shap_explainer, shap_values, X_valid, id)
show_shap(shap_explainer.expected_value, shap_values, X_valid, id=None)
show_shap(shap_explainer.expected_value, shap_values, X_valid, id=None, x="hours_per_week",
y='sex', disp=raw_train)