Caching a dataset processed with randomness

ivnle · December 13, 2023, 6:45am

Is it possible to cache datasets that have been processed with a “random” function?

For example, this tutorial Process uses random.randint to pick words to mask. I believe this will break the hashing that Datasets relies upon to perform caching so I thought to use a generator like so.

        def sub_tok(batch, g):
            input_ids: List[List[int]] = batch["input_ids"]
            for i, example in enumerate(input_ids):
                token = example[
                    torch.randint(0, len(example), (1,), generator=g).item()
                ]
                input_ids[i] = [MASK if tok == token else tok for tok in example]
            batch["input_ids"] = input_ids

        g = torch.Generator(4444)
        lm_datasets["train"] = lm_datasets["train"].map(
            sub_tok,
            fn_kwargs={"generator": g},
            batched=True,
            num_proc=args.preprocessing_num_workers,
            load_from_cache_file=not args.overwrite_cache,
            desc=f"Enciphering XXX tokens per example",
        )

But I receive the error TypeError: cannot pickle 'torch._C.Generator' object. Removing the generator allows for the dataset to be processed but breaks caching as expected. Any advice?

mariosasko · December 15, 2023, 2:01pm

Hi! I opened a PR that implements a serializer for torch.Generator to avoid the pickle error.

PS: torch.Generator expects a device as the first argument in the constructor, not a seed.

Topic		Replies	Views
Dataset can't cache model's outputs 🤗Datasets	3	473	October 27, 2022
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5615	September 18, 2020
Is from_generator() caching? how to stop it? 🤗Datasets	2	619	June 27, 2025
`datasets.map` calls a function that requires a `transformers.PreTrainedModel` object - unpickable object 🤗Datasets	2	1909	December 2, 2022
Caching and Shuffling Datasets on the Same Machine 🤗Datasets	1	393	July 21, 2023

Caching a dataset processed with randomness

Related topics