There is this paper that I have been trying to reproduce (https://arxiv.org/pdf/2205.11482.pdf) as part of my master’s thesis. It uses T5 to learn facts from the training set where either the object or the subject is masked with a sentinel token. An example of a training sample (called abstracts) can be seen here:
Input: “Animal Farm is an allegorical and dystopian novella by <extra_id_0>, first published in England on 17 August 1945.”
Target: “<extra_id_0> George Orwell”
The entire dataset can be found here ekinakyurek/ftrace · Datasets at Hugging Face
The thing I’m wondering is that in the docs, the use of sentinel tokens are as specified:
Input: “The <extra_id_0> walks in <extra_id_1> park”
Target: “<extra_id_0> cute dog <extra_id_1> the <extra_id_2>”
i.e. a sort of inverse of each other’s masking.
You will notice that this is not the case for the example from the dataset that I’m working on. If I’m right the target should be “<extra_id_0> George Orwell <extra_id_1>” since the input mask is in the middle of the abstract.
It is far from the only case as you will see if you explore the dataset.
This has left me to wonder how this “not-so-perfect” placement and formatting of sentinel tokens might affect training of T5? Should it be considered a serious data-quality issue or does its implications sort of go away with training on a lot of data?
Thanks for reading through my question! Hope that someone will be able to clarify my doubts:)