TLDR; Our previous paper showed that interpreting neural networks can be difficult due to issues in the underlying model’s confidence. Here, we fix some of the model confidence issues by applying Deep k-Nearest Neighbors to NLP, which improves model interpretations.

Our paper, full code, and supplementary website provide further details.


The growing ubiquity of neural networks in sensitive domains raises questions about human-perceived trust in these machine learning systems. A critical issue is test-time interpretability, how can humans understand the reasoning behind neural network predictions?

One common interpretation technique highlights input features based on their importance to the model’s prediction. See, for example, SmoothGrad’s interpretation of a model that predicted “Obelisk”. Image taken from the PAIR site.

Obelisk Photo Obelisk Smoothgrad

For NLP, one simple interpretation method is Leave One Out: individually remove each word from the input and measure the change in model confidence. If a word’s removal considerably changes the output confidence, that word is deemed important to the prediction.

Interpretation and Model Limitations

Recent work has identified a number of limitations for saliency-based interpretations. We discussed one particular limitation in a previous paper, the overconfidence issue of neural models.

In short, a neural network’s confidence can be unreasonably high even when the input is void of any predictive information. Therefore, when removing features with a method like Leave One Out, the change in confidence may not properly reflect whether the “important” input features have been removed. Consequently, interpretation methods that rely on confidence may fail due to issues in the underlying model.

A common failure mode that results from overconfidence is assigning small importance values to many of the input words. This occurs because no matter which word is removed, the confidence of the model remains high, giving each word a small, but non-zero importance value.

Leave One Out Saliency Map

This is illustrated in the Sentiment Analysis example above for both Leave One Out using model Confidence and the Gradient with respect to the input. Leaving out the word “diane”, causes a 0.10 drop in the probability for the positive class, whereas removing “shines” causes a smaller confidence drop of 0.09. This does not align with human perception, as “shines” is the critical word for classifying the sentiment as positive. Moreover, “unfaithful” is incorrectly assigned a negative value (the word refers to the title of a film).

Deep k-Nearest Neighbors

Papernot and McDaniel introduced the Deep k-Nearest Algorithm (henceforth DkNN). The algorithm presents a simple modification to the test-time behavior of neural models.

In DkNN, training is conducted normally. Then, before test-time, each example from the training set is passed through the model and the representations at each layer of the network are saved. This creates a new dataset whose features are the network’s representations and the predicted classes are the labels. To make a test prediction, the representations for each of the network’s layers are computed for the test point. Those features are then used for k-nearest neighbors classification. This modification does not degrade the accuracy for image classifiers (see original work) or text classifiers (see our paper).

DkNN Graphic

For our purposes, the main benefit of this decision making procedure is the robust measure of model uncertainty generated using the conformity score. This score is the percentage of nearest neighbors that belong to the predicted class. For example, in the figure above taken from the DkNN paper, 17/18 of the nearest neighbors belong to the Panda class for the non-adversarial input, whereas 12/18 neighbors belong to the school bus for the adversarial input. This trend holds generally, on “easy” evaluation examples, the conformity score will be near 1. On out-of-domain inputs such as adversarial examples, the conformity score will be closer to zero.

Using the conformity measure, we can generate interpretations by applying a modified version of Leave One Out. After individually removing each word, rather than measuring the resulting drop in confidence, we instead measure the resulting drop in conformity. Intuitively, when we remove the “important” words, we hope the conformity score will drop significantly. Conversely, the network should be invariant to unimportant words, causing their removal to have a neglible effect on the learned representation and thus the credibility score.

Interpretation Results

We compare our interpretation method (Conformity Leave One Out) to standard Leave One Out (Confidence) and gradient-based interpretations (Gradient) in the table below. Saliency Comparison

Using conformity for interpretations, rather than confidence, provides a few useful benefits. First, the change in conformity better separates the importance values, dropping just 0.03 when “diane” is removed, but 0.38 when “shines” is removed. This mitigates the issue where a majority of the input words are assigned a small importance. Notice that the issues with confidence are not simply in the scale of the importance values. Confidence Leave One Out actually assigns a higher importance score to “Diane” than “shines” as previously discussed.

The second observation for confidence-based approaches is a bias towards selecting word importance based on the inherent sentiment of a word, rather than its meaning in context. For example, see “clash”, “terribly”, and “unfaithful” in the table above. The removal of these words causes a small change in the confidence. When using DKNN, the credibility score indicates that the model’s uncertainty has not risen without these input words and thus does not assign them any importance.

SNLI Artifacts

Our interpretation technique is more precise at identifying the words neccesary to make a prediction. Where else can we apply this?

One area we were eager to explore further was the annotation biases identified in the SNLI dataset. Multiple groups independently showed that training models on only a portion of the input could yield relatively high test accuracy. This occurs due to biases introduced in the crowdsourcing process that correlate with certain labels.

We wanted to see if models that use the full input also learn these biases. We create saliency maps using our interpretation method, using the color blue to highlight words that support the model’s predicted class, and the color red to highlight words that support a different class.

SNLI Interpretations

The first example is randomly sampled from the validation set, showing how the words “swims” and “pool” correctly support the model’s prediction of contradiction. The other examples are selected because they contain terms identified as artifacts. In the second example, our interpretation method assigns extremely high word importance to “sleeping”, disregarding other words necessary to predict Contradiction (i.e., the Neutral class is still possible if “pets” is replaced with “people”). In the final two examples, the interpretation method diagnoses the model failure, assigning high importance to “wearing”, rather than focusing positively on the shirt color. Interestingly, words identified by conformity Leave One Out better align with annotation biases compared to standard Leave One Out (see paper for the full details).


Many recent interpretation methods use a variety of techniques to address issues in the underlying model (e.g., local gradient noise, satisfying interpretation axioms) while treating the model as an immutable black-box. In our work, we made minor model changes (without harming accuracy), which led to interesting interpretation results.

In the future, I think having the flexibility to make changes to the training- or test-time behavior of models will yield interpretability benefits. Rather than a black-box view, we can take insights from the latest understanding of neural network decision surfaces and adversarial examples, model training techniques, and the biases models learn from data. As our emprical and theoretical understanding of neural models grows, we can interpret their test-time behavior in more principled and insightful ways.