Hi all, I am wondering if anyone might be able to provide some insight or logical reasons for say a BERT model trained on SQuAD for Q & A tasks might output different answers to the same question given the only difference is the absence or presence of a concluding full stop in the context (but also interested in other punctuation/performance for that matter). It does differ between models, the below capture is using
bert-large-uncased-whole-word-masking-squad2 and I get consistent answers from roberta-base-squad2 (that’s obviously not shocking but wanted to add that model robustness might differ between models and why that might be for this particular observation).
Additionally if anyone can recommend any papers on this that would be great!
TYIA!!!
Capture.PNG|690x477