SHAP Value [MASK] vs attention mask

AhmadPython · October 24, 2024, 1:44pm

The official SHAP python library uses [MASK] token for masking tokens, so they can measure its influence on the models predictions.
My intuitive thought was that adjusting the attention mask is a cleaner implementation of including and excluding individual tokens.

ChatGPT argued, that adjusting the attention mask is the cleaner implementation in theory, but models like Bert were not trained on these kind of attention masks and this would “altering the structure of the transformer. BERT’s architecture assumes full token visibility across the sequence (even for masked tokens), and masking via the attention mechanism fundamentally changes how BERT processes information”.

Does anyone have experience in putting holes in the attention mask?
Does the attention mask behave different in other models, like not fully excluding the position?

Greeting,
Ahmad

Topic		Replies	Views
Is the attention mask and tokenization taken into account? Beginners	0	349	December 7, 2021
Is attention_mask implemented correctly in BERT? 🤗Transformers	2	2570	November 12, 2023
Role of attention mask in base Bert 🤗Transformers	0	329	December 22, 2022
How does BERT know if a token is a mask for prediction and loss Beginners	0	419	June 30, 2022
Doubts about the tokenization strategy and the explanation of models through SHAP 🤗Tokenizers	0	227	May 22, 2024

SHAP Value [MASK] vs attention mask

Related topics