SHAP Value [MASK] vs attention mask

The official SHAP python library uses [MASK] token for masking tokens, so they can measure its influence on the models predictions.
My intuitive thought was that adjusting the attention mask is a cleaner implementation of including and excluding individual tokens.

ChatGPT argued, that adjusting the attention mask is the cleaner implementation in theory, but models like Bert were not trained on these kind of attention masks and this would “altering the structure of the transformer. BERT’s architecture assumes full token visibility across the sequence (even for masked tokens), and masking via the attention mechanism fundamentally changes how BERT processes information”.

Does anyone have experience in putting holes in the attention mask?
Does the attention mask behave different in other models, like not fully excluding the position?

Greeting,
Ahmad

1 Like