Hi:
Might you double check the documentation here: Transformers-Tasks-Language Modeling?
Specifically, the Tensorflow section that deals with DataCollator reads:
"You can use the end of sequence token as the padding token, and set mlm=False
. This will use the inputs as labels shifted to the right by one element:
The code reads:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
The code doesn’t contain the padding token.
And then:
"For masked language modeling, use the same DataCollatorForLanguageModeling except you should specify mlm_probability
to randomly mask tokens each time you iterate over the data.
And the code reads:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
The code has mlm=False
.
Also, the notebook associated uses the default DataCollator in the Causal Language Modeling Section. But elsewhere in the documentation (Course: Training a Causal Language Model from Scratch), it reads: By default [ DataCollatorForLanguageModeling
] prepares data for MLM…
Thanks!