Explainable AI
The role of feature importance
Wednesday, Jul 3, 2024 • 17 minutes • Melle van der Heijden
In this article, we’ll walk you through some key concepts in explainable AI using a narrative example. We’ll apply the two most commonly used methods for interpretability, LIME and SHAP. Following these links, you'll find more information about popular python packages we used for these methods.
When you’re done reading this article, you will:
- Know how you can use these methods to provide meaningful insights into the decision strategy of your models.
- Be able to use the differences in feature importance properties to your advantage and avoid some common mistakes.
- Understand what considerations are important to make when using any other technique that computes feature importance.
70k Job Applicants
To help shape ideas into something more concrete we’ll consider the scenario of a recruiter using applicant data to evaluate candidates. The "Employability Classification of Over 70,000 Job Applicants” is a dataset containing information about applicants in the IT sector and can be found here on Kaggle. It consists of generic features like the applicant’s age, gender and nationality as well as features about the applicant’s experience and specific skills he or she has. The recruiter feels that by training a machine learning model to predict whether applicants in this dataset get hired, he will be able to more efficiently find suitable candidates and gain insights into the factors that play an important role in the hiring process.
To generate our examples we often choose clarity over precision. With this in mind, we kept the amount of features after preprocessing to a minimum. As model we used a Random Forest classifier, which highlights the difference between SHAP and LIME quite well due to its nonlinearity. it's also computationally quite light, making it easy to pick up our notebook and experiment for yourself.
Feature importance techniques
You may already be familiar with LIME and SHAP. These are among the current most popular techniques to explain model predictions. Both of these methods are model-agnostic, this means they can be applied to any type of model. LIME and SHAP are used to compute feature importance values, these value how ‘important’ a particular feature is for the predictions of a model. While the importance of a feature seems to make intuitive sense at first glance, the exact meaning of importance is not well defined in this context. In the paper Characterizing Data Scientists’ Mental Models of Feature Importance, Researchers found that expert expectations of feature importance techniques do not always align with the methods used. As the complexity of ML models increases, we rely more and more on techniques like LIME and SHAP to dissect their strategies. Due to ambiguity in what constitutes 'importance' in different contexts there is risk of getting a false sense of security. figure 1 shows the top-10 most important features for both LIME and SHAP. It clearly demonstrates the variety found in these two methods. The label value refers to ground truth target variable, it's valued 1 in case the individual was hired, 0 otherwise. It's this value our model is attempting to predict.
Local and global feature importance
Models can be explained from two complementary perspectives: Local feature importance pertains to the prediction on a single instance (row) of your dataset. One instance includes all the inputs used to make a single prediction. global feature importance pertains to the importance for the predictions on an entire dataset. While global feature importance gives you an initial understanding of the model, it is highly simplified and will likely obscure a lot of details. In figure 2 we expanded on previous example by adding a second instance for both LIME & SHAP.
As you can see, feature importance values can vary wildly between instances and techniques. To understand how this can happen, it is important to first get a basic understanding of how these techniques work. We made sure to include plenty of technical sources for you to come back to later, but let’s stick with the rundown for now.
LIME
LIME is a very popular technique introduced in the paper “Why Should I Trust You?” Explaining the Predictions of Any Classifier. This is a technique that approximates the gradient of a model. It does so by training a simpler surrogate model on transfer data; a set of data points labelled by the reference model you’re attempting to interpret. There are two ways to obtain a surrogate model. The first is to globally mimic the reference model with an inherently simple surrogate model. However, due to this simplicity, the resulting surrogate can often not faithfully represent the reference model, which leads to inaccurate or incorrect explanations. Another approach is to consider only a small part of the complex reference model, and locally mimic that portion. Such surrogate models remain locally faithful to the reference model, while also being simple enough to understand.
LIME applies this approach by targeting only the part of the model that is relevant for a particular prediction. First a simple model is trained on transfer data. Samples from a constrained region around the instance are used. These samples are weighted by the inverse distance to the instance to be explained - the distance kernel. This step makes transfer data more relevant when it closely resembles the instance to explain. If a linear regression surrogate model is used, the coefficients of that surrogate model will approximate the derivative of the reference model. Common variations of LIME (LEMON, ORANGE) differ in their methods of sampling and weighting the transfer data. Conceptually, this comes down to a narrower definition of what constitutes as ‘local’.
SHAP
Shapley value based approaches (like SHAP) find their basis in game theory and pose the distribution of feature importance as a cooperative game, where each feature is a player. In order to capture the influence of interactions between features, SHAP considers how the model prediction changes for each subset, or ‘coalition’, in the power set of features (a set including all subsets, including the empty set and the original set itself). The importance value of feature \(A\) is the aggregate of all prediction changes when sampling different subsets with and without feature \(A\). To wrap our minds around this idea, let's imagine the robots depicted in figure 4 are playing a game. We need to determine which robot has the AI best suited for this game. Everything is digital, so we do have the benefit of being able to simulate this game any number of times we want, using different coalitions of robots.
In the top left corner we find none of the robots actively playing, you can imagine all robots in this scenario got replaced by an actor making a random move on each turn. The accumelated gains if none of the robots are playing is zero. When all robots are playing however - as depicted in the bottom right - they collectively acquire a nice stack of dollars. If we want to calculate the average contribution of the red player, we can take the sum of all accumelated gains including the red player, minus the sum of all accumelated gains excluding the red player. The resulting value is the Shapley value for the red player.
For ML models, we consider each feature to be a player, to get the model’s prediction without a specific feature a replacement value is randomly drawn from your dataset. The resulting Shapley value can be interpreted as the average contribution of a feature to the model prediction in different coalitions of features, measured from the mean prediction of the mode of your dataset, called the base rate. The mode of your dataset is an instance constructed by combining all feature values that are most common. SHAP, introduced in the paper A Unified Approach to Interpreting Model Predictions combines a set of algorithms to approximate Shapley values. For this article we used the model-agnostic KernelSHAP. Other algorithms provide computationally efficient ways to approximate the Shapley values for different families of models.
What do feature importances mean?
Additivity & proportionality: theory of relativity
Imagine you found a tick. Upon seeing the doctor, he hands you a diagnostic report for lime disease. The report comprises the explanation of an algorithm used as diagnostic aid. At the top of the report you find some value labeled "prediction", but you have no idea how to interpret it. To indicate how your features combine to result in the algorithm's prediction, it includes a bar chart of the feature importance values. In the bar chart, a lot of feature importances seem very high. This looks very bad!
If SHAP is used, this would be a correct assumption, since Shapley values are additive: its total sum approximates the prediction of the reference model. This means the resulting values are proportional to the model’s prediction, a characteristic which makes Shapley values very popular. Knowing exactly how many dollars a single feature contributes to the prediction of a house price can make an explanation very concrete and relatable. The SHAP package uses a waterfall plot in an attempt to highlight this special property, shown in figure 6.
This obvious upside comes at a cost. When decomposing your model’s prediction into individual feature contributions, the additive property implies importance can always be distributed between single properties. This is usually not the case. We can demonstrate this using a hypothetical example: We train a a decision tree classifier which uses the features ‘length’ and ‘weight’ to predict malnourishment. To do this, it would form conditions like ‘if weight below x and length above y’, evaluating a pattern between features, instead of evaluating each feature separately. In scenario’s like this, it's unclear how to ideally distribute feature importance between features.
An example of this in our job applicants dataset are the features 'YearsCode' and 'YearsCodePro'. YearsCode encapsulates the total of how long the applicant has coded professionally; YearsCodePro, as well as the years he or she has coded as an amateur. LIME can accurately approximate the actual gradient of each of these features respectively. In the proportional context of Shapley values it's unclear what definition should apply. It's very common for features used by ML models to be correlated in this manner, and examples are not always as intuïtive as these.
LIME approximates the reference model's derivative. The coefficients of the surrogate model represent the rate of change in the reference models prediction. A large feature importance value for LIME implies that slightly changing the value of the feature will drastically change the model’s prediction. This is especially true when the reference model least certain, right at the decision boundary. this key difference between gradient-based techniques like LIME and ablation-based techniques like SHAP can be visualized with the logistic curve, seen in figure 7 For this function, you can clearly see why the gradient is often largest closer to the mode (dotted line). Ablation (like Shapley values), in contrast, is largest further away from the mode.
The additive and proportional nature of Shapley values makes them informative as standalone metric. You can easily construct the model's prediction using basic arithmetic; adding the Shapley values to the base rate. Shapley values tell us how the different features combine to make model's prediction in a voluminous sense, individual parts combining to make a whole. Since LIME importances approximate the derivative of the model at a certain local area in feature space, we also need the feature value for LIME to be truly informative. Combining the derivative with the feature value, LIME informs us how the reference model behaves locally.
How do we know we can trust the explanation?
Selectivity: show, hide or highlight information
Research from social sciences shows that people do not expect explanations to provide a complete account of all causes for an event. Instead, people select a subset of causes for the explanation they believe to be the most important. A feature importance method can be selective by limiting the explanation to the most important features. For example, the standard implementation of LIME, by default, reduces the number of features in an explanation with feature selection using Lasso (linear regression with L1 regularization). In contrast, SHAP explanations will include all features in the final explanation, although not all of them will be explicitly included in the plot. In figure 8 you can find 2 extremes explaining the same instance for the same model. One is accurate but overloads our brain, the other very simple, but how insightful is it?
Faithfulness: reflecting model policies
Explaining complex machine learning models means simplifying their decision strategy to the extent the interpreter can understand it. The desired simplicity is in tension with the complexity of your model. The best explanation is the one that strikes a balance between simplicity for the stakeholder and the complexity of the model you are explaining. Oversimplification will result in an explanation that is no longer faithful to the reference model. Since we will use explainable AI explanations to justify ML model behaviours, it's important that we can verify whether the explanation is an accurate reflection of the model. We can do this by using the explanation itself as a model to generate new predictions. In this way, the explanation can be quantitatively compared to the reference model, producing a faithfulness score.
Faithfulness is sometimes called fidelity. There are different ways to estimate it for different techniques. Let’s start with LIME.
LIME uses a surrogate model to mimic our reference model locally and produce an approximation of its gradient. The faithfulness score is determined by how well this surrogate model fits the reference model in that local region defined by the distance kernel. The LIME package will return the surrogate’s \(R^2\) score on all transfer data, using the weights defined by its distance kernel. it's important to note that to score the surrogate model the LIME package will use the same transfer data used to train the surrogate model, without applying cross-validation. This faithfulness evaluation will vary in different regions of your data, which should be seen as a reflection of what this technique aims to achieve. Imagine a linear surrogate model is used. When the surrogate is trained with plenty of transfer data, the best linear fit for on the transfer data will be found. In this case, having a low faithfulness score can only mean the transfer data the linear model is trying to fit is still highly nonlinear. This may indicate the kernel width is too large, making the ‘local region' very big. The reference model’s decision function could also be very erratic in this region, making a proper linear fit impossible.
The circle in figure 9 represents the kernel-width of LEMON. LEMON is a variation on LIME which only samples transfer data from within the boundaries of this circle. The dividing line represents the reference model's gradient approximated by LEMON. As you can see in this particular situation, while the explainer is faithful for small kernel widths, as it gets larger it becomes less and less faithful (the line is less aligned with the background gradient).
Shapley values are additive by definition, this means SHAP always finds an explanation that properly fits the prediction of the reference model. In the previous chapter, we learned that SHAP’s additive property leaves ambiguity in the definition of feature importance when concerning correlated features. In this light, it's not clear whether the near-perfect predictive score of the SHAP explainer can be regarded as the algorithm always being faithful to the reference model.
What can’t we learn from our models?
Actionability: influencing model predictions
Let’s say you have a mansion you’re trying to sell (it's ok to dream a little). Your mansion has 10 bathrooms, 1 for each bedroom. Your broker uses some complicated model to estimate a good selling price, and shows you the Shapley values computed for your house. You conclude all bathrooms combined contribute around $ 250,000 to the sales price.
You decide to do some remodelling before putting your mansion on the market. If you stick to a $ 10,000 budget, you stand to gain $ 15,000 for every extra bathroom you build! So you put in the effort and double the amount of bathrooms, every bedroom is now enclosed in bathrooms. Satisfied after a hard day of manual labour, you return to your broker. He runs the numbers again. Surprise! The predicted house price didn’t budge a single dollar. The Shapley value indicating the contribution of - now 20 - bathrooms to the sales price is still sitting at $ 250,000.
Shapley values may be a great way to determine a feature’s contribution to the total sales price, this doesn’t mean they’re actionable in a way that influencing the feature will change the price.
Since LIME approximates the reference model’s gradient, the feature importance values it computes are inherently actionable with respect to the model prediction. you can multiply the LIME feature importance by an arbitrary change in the feature’s value to get a good approximation of change in the reference model’s prediction. Going back to our hypothetical example, you might imagine the model’s gradient with respect to the amount of bathrooms to be zero, probably even lower if this were a real example. This clearly demonstrates LIME importance’s and Shapley values are very different things.
In figure 10 above we can see how both LIME and SHAP handle a situation where the value of the "gender" features was changed. Notice how LIME feature importances stay similar relative to eachother but almost double in size. Highlighting a local case where experience is the determining factor for our model's decision, but male experience is worth more than female's.
Correlation: not causation
Which brings us to causation. The way in which switching gender is or isn’t ‘actionable’ jumps out with this example. Our machine learning models are unable to find causal relationships. They take in the features we measured and strictly find correlations between them. Even if our models could tell the difference, it's not very common that we are able to directly measure the thing we care about. Take income for example, which doesn't have a consistent definition. It may include or exclude income from different sources. Different institutions each have their own definition. We don’t measure income directly as it's pouring in. Instead, we might use the income reported in tax filings and pretend it's the same thing. Be very careful when drawing conclusions from our model’s decision strategies. Make sure you understand how features are defined, what is being measured and how this relates to the thing you wish to learn about. Never assume the relationships our models find are causal, even when they ‘make sense’.
To make interpretation even more difficult, our models - when regularized - tend to use the most ambiguous features. Features which correlate well with numerous other features are called confounding features. They are very useful for our models because they are able to capture general information about the numerous other features it correlates with. As we showed earlier, they stir up trouble when we try to dissect what information is important to our model. Not only are they an ambiguous conglomeration of different things, these highly confounding features have the habit of being lower in the chain of influence. They are often on the receiving end of multiple other features. An excellent example would be any social or economic indicator. Statistics like these are often some normalized aggregate of numerous conditions. Very concise single features appear as if speaking volumes, which is why us humans often fall for the same trap as our machine counterparts. In our examples we can clearly see this effect with our ‘ComputerSkills’ feature, which is often regarded as most important by both techniques.
Variations in explanations
Stability: sensitivity to explainer parameters
With every explanation technique there will be certain implementation choices to make. For example, both LIME and SHAP rely on sampling of data instances. There are many different ways of doing this and choosing a different strategy will naturally yield different results. LIME also comes with the choice of a surrogate model, the model parameters to go with it.
For complex models, feature importance values will always constitute an approximation of the model’s underlying prediction-generating mechanism. Consequently, there are usually various alternative, equally valid, explanations for the same prediction. However, feature importance is typically presented as a single value per feature, which may disguise the inherent uncertainties in how the values are derived. Although we can use the flexibility of using different methods and parameters to investigate certain aspects of our model. The stakeholder that’s impacted by our model’s decisions is right to distrust an explanation which varies heavily with implementation specifics. An ideal explanation would not change if the model did not change. We refer to the variance in feature importance values when using different explanation parameters as the stability of the explanation.
The examples shown in figure 11 are the result of a dramatically small sample size (5 samples) for SHAP, and doubling the kernel width for LIME.
Robustness: sensitivity to model inputs
Where stability is the sensitivity of an explanation to varying explainer parameters, robustness is the sensitivity to slightly changing feature values.
The two instances in figures 12 & 13 are quite similar in most regards. If we take a look at the Shapley values for these instances we can see they are both, like the instances, somewhat similar except for the 'ComputerSkills' feature. Looking at our model’s predicted score, we are right on the decision boundary. Our model doesn’t quite know how to properly classify these instances, but, thankfully our SHAP explainer remains quite stable and paints a clear picture, repeating a similar narrative for both instances.
LIME also holds up quite well, a few subtle changes, of which 'ComputerSkills' is the most important. As the value of 'ComputerSkills' increased by \(\small{3}\), the model's gradient with respect to it also increased by \(\small{.27}\). For LIME, these effects are multiplied.
In Characterizing Data Scientists' Mental Models of Local Feature Importance, Researchers found data scientists prefer explanations which are very robust. It's important to keep in mind that LIME produces approximations of the model’s derivative which are locally accurate. If the results LIME produces are very erratic between similar instances, this could be evidence of an erratic decision strategy employed by our model in this region. When interpreting a classifier model this could be close to the decision boundary, where tiny perturbations to feature values could result in a different class prediction. If our model’s prediction isn’t robust, variations in the explanations given by LIME may be just the right tool to investigate why.
We can help
When explaining black box models, there is no such thing as a one-trick pony and building trust in our models can be tedious. A combination of techniques has to be employed spanning over different scopes of the model. Getting acquainted with the decision strategy is a step-by-step process. First, you collect general information about your model and use it to develop a simple hypothesis. This hypothesis can be tested by employing a range of explainable AI techniques. Your findings will require you to shift and nuance your hypothesis and continue digging.
To help speed up this process, Xaiva comes packed with all the necessary tools implemented in ways you’ve yet to think of. Xaiva Analyze precomputes model statistics using your training and validation data, allowing you to quickly identify meaningful groups of instances and global trends. Using the Xaiva Explain interface you can easily search regions of featurespace to see how your model behaves locally, adapting hyperparameters of the explainer on the fly. Looking "around" an instance you're trying to explain is easy with the interactive visualizations, use this to see how your model's behavoir changes in the immediate vacinity of an instance. The platform’s UI can easily be adjusted to fit the specific needs & experience of users. Experts can make use of the Python API, providing full power & flexibility.