Pipeline with custom dataset tokenizer: when to save/load manually

I am trying my hand at the datasets library and I am not sure that I understand the flow.

Let’s assume that I have a single file that is a pickled dict. In that dict, I have two keys that each contain a list of datapoints. One of them is text and the other one is a sentence embedding (yeah, working on a strange project…).

I know that I can create a dataset from this file as follows:

dataset = Dataset.from_dict(torch.load("data.pt"))
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
keys_to_retain = {"input_ids", "sembedding"}
dataset = dataset.map(lambda example: tokenizer(example["text"], padding='max_length'), batched=True)
dataset.remove_columns_(set(dataset.column_names) - keys_to_retain)
dataset.set_format(type="torch", columns=["input_ids", "sembedding"])

My question is, what’s next? Especially considering how caching works. The first thing that should happen is splitting the dataset into a train, dev, test set. As a result I would eventually have a dictionary with train, dev, test keys in them. I can then use them in dataloaders and I am ready to go.

The question is, what about subsequent runs (.e. new Python sessions). Will all that code need to be run again? Should I do dataset.save_to_disk, and in a next session not run the whole dataset creation again? In other words, do I have to manually check for the saved files? Something like this (untested).

def create_datasets(dataset_path):
    if Path(dataset_path).exists():
        datasets = {partition: load_from_disk(Path(dataset_path) / partition) for partition in ["train", "dev", "test"]}
        # the snippet that I posted above
        # assuming we have train, dev, test in datasets
        for key, dataset in datasets.items():
    return dataset

Or is the dataset cached somewhere and every time the first snippet is encountered, none of those steps is repeated and the cached dataset is loaded?

In short, it is not clear to me when I can rely on cache that is hidden (probably somewhere in the user directory), and when I should manually use save_to_disk and load a dataset manually.


1 Like

The caching should work across sessions, normally you don’t have to use save_to_disk. The cache is indexed by a hash of the operations performed on the dataset, if a new, independent, session performs the same operations, they will use the cache instead of being recomputed. If you change something to the operation performed on the dataset, they will be recomputed instead of using the cache.

I will add a detail on the hashing mechanism to the doc when I have some time (no ETA) but basically it use as hash to store the dataset a complete pickle dump of all the arguments you provide the processing function at each step (including the function provided to map) so if anything changes it will be detected and the operation is recomputed instead of using the cache. If all the arguments and inputs are identical, the hash is the same (whether it’s the same session or not) and the cache file is used if it is found.

save_to_disk is provided as a special utility mostly for people who preprocess a dataset on one machine which has access to the internet and would like to use the dataset on a cluster without any access to the internet (and which thus cannot download the dataset files).

1 Like

The idea is that you can write a simple and readable code once and not care that it is redoing the downloading/pre-processing operations when you run it several times because all these are automatically cached.

So to summarize, I can just use the following snippet in my script and it will automatically know to skip those dataset processing lines? I assume that it will load in the cached dataset in the Dataset.from_dict and then skips all operations on the dataset by checking whether they are already done in the hash. I do not know enough about hashing to know how this exactly works. If I tokenize with another tokenizer, will it not use the cached version and do the tokenization from scratch with the new tokenizer?

That being said, that does mean that the tokenizer is created but not used (at this stage) as is keys_to_retain . That is not a problem; I am just trying to understand.

def create_dataset(data_path):
    dataset = Dataset.from_dict(torch.load(data_path))
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
    keys_to_retain = {"input_ids", "sembedding"}
    dataset = dataset.map(lambda example: tokenizer(example["text"], padding='max_length'), batched=True)
    dataset.remove_columns_(set(dataset.column_names) - keys_to_retain)
    dataset.set_format(type="torch", columns=["input_ids", "sembedding"])
    return dataset

ds = create_dataset(data_p)
dataloader = Dataloader(ds, ...)

Yes exactly.

If you tokenize with another tokenizer at some point it won’t use the cache indeed and will create a new cache file (the inputs function given to map is pickled recursively to extract all the dependencies).

Also, you don’t need to do remove_columns_, setting the format with the list of columns is enough (see https://huggingface.co/docs/datasets/torch_tensorflow.html)

About your last comment: I can see that after set_format, the other columns are still present. It seems from the documentation that the difference is that when __getitem__ is called (e.g. in the dataloader) only those columns are returned. But doesn’t that mean that the size of the dataset on disk is a lot bigger than I need?

Let’s say that I do not need attention masks and type_token_ids. If I use set_format that information will still be present in the dataset somewhere, but it just won’t be returned. That would mean that the on-disk dataset also contains those columns, even if they might not be necessary. Isn’t it better to remove those completely, then?

I also found that the caching does not seem to work as I expected. The “mapping” happens at each new run of the script. For now I resorted to my earlier suggestion of saving the dataset after all operations, and checking beforehand whether the processed directory exists. If so, just load it. If not, process and save.

Well for most of these operations we go with the fastest option, i.e. not updating the underlying storage but only the in-memory representation.

If you need to be vary careful about HD memory, you can map the dataset again to update the HD repr or you can use the load_dataset(keep_in_memory=True) option to keep the dataset always in RAM (see here).

Ok, this should not happen if you can share a snippet to reproduce it we can investigate why caching is not working.

Which tokenizer are you using ? I made sure that the caching works for many tokenizers but maybe there’s one I haven’t tested

This occurred for me with wietsedv/bert-base-dutch-cased. Can you let me know whether you can reproduce the issue?

Do you think you could share a snippet @BramVanroy? That would help @lhoestq a lot because it might be quite specific to the exact processing you decided to do.

@thomwolf @lhoestq If you run the function below with any given text file (one sentence per line) , it will run the mapped function every time that you run the code for the same data and identical arguments.

def prepare_data(dataset_f: str,
                tokenizer: PreTrainedTokenizer,
                max_seq_length: int = None,
                batch_size: int = 64,
                num_workers: int = 0) -> Dict[str, DataLoader]:
   """Given an input file, prepare the train, test, validation dataloaders.
   :param dataset_f: input file
   :param tokenizer: pretrained tokenizer that will prepare the data, i.e. convert tokens into IDs
   :param max_seq_length: maximal sequence length. Longer sequences will be truncated
   :param batch_size: batch size for the dataloaders
   :param num_workers: number of CPU workers to use during dataloading. On Windows this must be zero
   :return: a dictionary containing train, test, validation dataloaders
   max_seq_length = tokenizer.model_max_length if not max_seq_length else max_seq_length

   def preprocess(sentences: List[str]) -> Dict[str, Union[list, Tensor]]:
       """Preprocess the raw input sentences from the text file.
       :param sentences: a list of sentences (strings)
       :return: a dictionary of "input_ids"
       tokens = [s.strip().split() for s in sentences]
       tokens = [t[:max_seq_length - 1] + [tokenizer.eos_token] for t in tokens]

       # The sequences are not padded here. we leave that to the dataloader in a collate_fn
       # (not included in this snippet for illustrative purposes)
       # That means: a bit slower processing, but a smaller saved dataset size
       encoded_d = tokenizer(tokens,

       return {"input_ids": encoded_d["input_ids"]}

   dataset = Dataset.from_dict({"text": Path(dataset_f).read_text(encoding="utf-8").splitlines()})

   # 90% (train), 20% (test + validation)
   train_testvalid = dataset.train_test_split(test_size=0.2)
   # 10% of total (test), 10% of total (validation)
   test_valid = train_testvalid["test"].train_test_split(test_size=0.5)

   dataset = DatasetDict({"train": train_testvalid["train"],
                          "test": test_valid["test"],
                          "valid": test_valid["train"]})

   dataset = dataset.map(preprocess, input_columns=["text"], batched=True)
   dataset.set_format("torch", columns=["input_ids"])

   return {partition: DataLoader(ds,
                                 pin_memory=True) for partition, ds in dataset.items()}

As a tokenizer I used the Dutch one (below), but the same issue exists with "bert-base-cased" so I don’t think the tokenizer is the issue.

tokenizer = AutoTokenizer.from_pretrained("wietsedv/bert-base-dutch-cased")
tokenizer.add_special_tokens({"eos_token": "[EOS]"})
prepare_data("path/to/data.txt", tokenizer)

Awesome! And may I ask which versions of the datasets and transformers library you are using?

  • transformers 3.1.0
  • datasets 1.0.1

The problem also occurs on both Windows and Ubuntu (only tested Python 3.7 for now).

Hi !

In this code you call train_test_split without setting a seed, which means that every time you run this code, it will shuffle the dataset in a different way. Since the created splits are always different, it can’t reload them from cache. You can use train_test_split(test_size=0.2, seed=...) to set the seed.

Furthermore, you are loading a dataset that is in memory. By default caching is enabled only for datasets on disk. What you can do is load an on-disk dataset using load_dataset("text", data_files=dataset_f). Another option is to write the dataset on disk at one point dataset.map(..., cache_file_name=<path/to/cache/file>)


Thanks for the reply. The seed should not be an issue I think because I manually set the seed at the start of the whole script (for random, torch, and numpy). Just to be sure, I’ll set the seed in that function, too. However, I moved away from using load_dataset because of this issue, so I can’t test whether the cache_file_name works.

@lhoestq, do you think from_dict and from_pandas should create a cache_file on drive by default?
With a keep_in_memory option?


Perhaps keep_in_memory is a bit confusing? For me it sounds as if that means “keep in RAM”. Maybe save_to_cache?

Yes that exactly what it means :wink: keep_in_memory mean keep in RAM

Ah, okay I thought you meant it the other way around. Sounds good!