Documentation: Transformers Language Modeling Section

wquinn · May 14, 2022, 3:15pm

Hi:

Might you double check the documentation here: Transformers-Tasks-Language Modeling?

Specifically, the Tensorflow section that deals with DataCollator reads:

"You can use the end of sequence token as the padding token, and set mlm=False . This will use the inputs as labels shifted to the right by one element:

The code reads:

from transformers import DataCollatorForLanguageModeling 
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")

The code doesn’t contain the padding token.

And then:

"For masked language modeling, use the same DataCollatorForLanguageModeling except you should specify mlm_probability to randomly mask tokens each time you iterate over the data.

And the code reads:

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")

The code has mlm=False.

Also, the notebook associated uses the default DataCollator in the Causal Language Modeling Section. But elsewhere in the documentation (Course: Training a Causal Language Model from Scratch), it reads: By default [ DataCollatorForLanguageModeling] prepares data for MLM…

Thanks!

Topic		Replies	Views
Error in DataCollator section of Hugging Face Tutorial LM fine tuning Beginners	2	258	January 12, 2024
A question about the DataCollator for LM 🤗Tokenizers	2	353	May 6, 2024
How is the data shifted by one token during CausalLM fine tuning Models	4	3183	April 14, 2025
Transformers v3.0.0 is out! 🤗Transformers	0	1937	July 7, 2020
Selective masking in Language modeling Beginners	1	2164	August 13, 2020

Documentation: Transformers Language Modeling Section

Related topics