Seeding everything to get the same masked words

Hi,

I am using a few different libraries (torch, hf transformers, lightning) so please feel free to let me know if this is outside the hf forum’s wheelhouse.

I have two datasets, for which I want to create two dataloaders that and want to mask the same tokens in each.

I tried L.seed_everything() and was unable to get what I want.

Does anyone have any suggestions? My code is here: Google Colab

Your two data collators share the same random generator, so they don’t sample the same random masks.

You can try copying the DataCollatorForLanguageModeling code and set a random generator that uses a fixed seed at initialization

1 Like