Q & A Model Robustness for concluding periods

pythagorasthe10th · January 23, 2021, 2:27am

Hi all, I am wondering if anyone might be able to provide some insight or logical reasons for say a BERT model trained on SQuAD for Q & A tasks might output different answers to the same question given the only difference is the absence or presence of a concluding full stop in the context (but also interested in other punctuation/performance for that matter). It does differ between models, the below capture is using
bert-large-uncased-whole-word-masking-squad2 and I get consistent answers from roberta-base-squad2 (that’s obviously not shocking but wanted to add that model robustness might differ between models and why that might be for this particular observation).

Additionally if anyone can recommend any papers on this that would be great!

TYIA!!!
Capture.PNG|690x477

lewtun · January 23, 2021, 8:40am

Hi @pythagorasthe10th, one possible explanation is that BERT’s attention heads are known to pay special attention to commas and full-stops in the last few layers:

This figure comes from What Does BERT Look at? An Analysis of BERT’s Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning where the authors explain the phenomenon in terms of the high frequency of periods and commas in the corpus:

Interestingly, we found that a substantial amount of BERT’s attention focuses on a few tokens (see Figure 2). For example, over half of BERT’s attention in layers 6-10 focuses on [SEP]. To put this in
context, since most of our segments are 128 tokens long, the average attention for a token occurring twice in a segments like [SEP] would normally be 1/64. [SEP] and [CLS] are guaranteed to be present and are never masked out, while periods and commas are the most common tokens in the data excluding “the,” which might be why the model treats these tokens differently. A similar pattern occurs for the uncased BERT model, suggesting there is a systematic reason for the attention to special tokens rather than it being an artifact of stochastic training.

I’m not sure if this conclusion carries through to question-answering / fine-tuning, but naively I would guess so. Perhaps you don’t see this in RoBERTa since the next-sentence prediction task is dropped, but I’m not sure about this either.

You might also find the BERTology papers of interest: [2002.12327] A Primer in BERTology: What we know about how BERT works

HTH!

pythagorasthe10th · January 25, 2021, 2:39am

Thanks a bunch @lewtun! This is very helpful!

Topic		Replies	Views
Effect of punctuations on Transformer models Beginners	0	546	January 12, 2022
SQuAD with BERT tokenizer: Mismatch between span and token boundaries Models	0	506	November 12, 2021
Question about BERT for qa Beginners	0	594	June 30, 2022
Punctuation and Spaces in RoBERTa Tokenizer for NER with Pre-tokenized Data 🤗Transformers	0	583	January 16, 2022
Strange shap analysis for text classification with BERT Beginners	10	893	September 17, 2024

Q & A Model Robustness for concluding periods

Related topics