Do we really preprocess the entire data set with hugging face even when we train very large language models e.g. gpt-3 size?

I see this code in the translation tutorial:

def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# todo - would be nice to remove this since gpt-2/3 size you can't preprocess the entire data set...or can you?
tokenized_books = books.map(preprocess_function, batched=True, batch_size=2)

do we really do this in the wild? or perhaps this is done on shards of the data? Can this be clarfied?

Hi! map operates on chunks of data:

  • if batched=False (default): a single example is given to the map transform
  • if batched=True: a batch of size 1000 (can be controlled with the batch_size parameter) is given to the map transform

After processing, it stores the processed examples/batches before writing them to disk to reduce the number of IO calls (default chunk size is 10000, can be controlled with the writer_batch_size parameter to reduce RAM usage).

And thanks to this, it can process datasets larger than RAM :slight_smile:.

when I use pytorch usually I only preprocess athe data loading stage – when I’m actually training afaik ppl apply the transforms there. I’ve never seen ppl writing down the preprocessed data to disk. That is what is very puzzling to me. It seems to be common? Why not just apply all the transforms at training + use multiple workers to make this efficient?

Oh, I see. Yes, processing ahead of time is common in NLP as we rarely have randomness in transforms, so it’s more efficient to apply them once on data (and cache the result) for all epochs via map. On the other hand, random transforms are super common in CV, so it makes more sense to rerun them each epoch. For that, you can use set_transform(transform), which applies the transform function on access.

ah, I see.

In my case I am doing different “tasks” and prompting the model at train time. So fetching the data in the loader and then manipulating my data at train time makes most sense.

e.g. the encoder receives the natural language (NL) text “Do task X given data Y” and then the decoder would be tasked with predicting the answer given X and Y say it would receive “Z” right shifted.