Seeding everything to get the same masked words

surya-narayanan · May 9, 2023, 1:33am

Hi,

I am using a few different libraries (torch, hf transformers, lightning) so please feel free to let me know if this is outside the hf forum’s wheelhouse.

I have two datasets, for which I want to create two dataloaders that and want to mask the same tokens in each.

I tried L.seed_everything() and was unable to get what I want.

Does anyone have any suggestions? My code is here: Google Colab

lhoestq · May 12, 2023, 9:38am

Your two data collators share the same random generator, so they don’t sample the same random masks.

You can try copying the DataCollatorForLanguageModeling code and set a random generator that uses a fixed seed at initialization

Topic		Replies	Views
Seeding Data Collator 🤗Transformers	0	223	May 12, 2023
Using a dataset with already masked tokens Beginners	2	702	February 3, 2021
Code about DataCollatorForWholeWordMask in github 🤗Transformers	0	558	October 12, 2022
Masking Probability 🤗Tokenizers	4	773	August 20, 2020
Masking specific token in each input sentence during Masked language modelling 🤗Transformers	0	1041	October 18, 2021

Seeding everything to get the same masked words

Related topics