For the Community Week project I’d like to extend the work of the McGill researchers who, by inducing ‘order agnostic’ tokens via a ‘permuted’ pre-trained version of BERT, get high task scores on a wide range of benchmark tasks (GLUE, PAWS, etc.) due to the model learning to represent distributional priors rather than its ability to “discover the NLP pipeline”.
I’d like to replicate the authors’ work and perhaps extend it by looking at the ‘permuted model’s performance’ on other benchmarks / sets of NLP and ‘NLU’ tasks for which we have human-curated ‘gold standard’ performance measures.
To start, the authors pre-trained models on various permuted corpora that preserve sentence-level distributional information by randomly shuffling n-grams within the sentence where n is [1,4]. While the authors evaluated the ‘permuted’ models’ performance against ‘normally-trained’ Transformer models in a wide range of settings - GLUE, etc. - I’d be interested to see how the models perform against some of the human-curated benchmark datasets such as PAWS.