Do we really preprocess the entire data set with hugging face even when we train very large language models e.g. gpt-3 size?

brando · August 10, 2022, 2:05pm

I see this code in the translation tutorial:

def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# todo - would be nice to remove this since gpt-2/3 size you can't preprocess the entire data set...or can you?
tokenized_books = books.map(preprocess_function, batched=True, batch_size=2)

do we really do this in the wild? or perhaps this is done on shards of the data? Can this be clarfied?

mariosasko · August 12, 2022, 11:48am

Hi! map operates on chunks of data:

if batched=False (default): a single example is given to the map transform
if batched=True: a batch of size 1000 (can be controlled with the batch_size parameter) is given to the map transform

After processing, it stores the processed examples/batches before writing them to disk to reduce the number of IO calls (default chunk size is 10000, can be controlled with the writer_batch_size parameter to reduce RAM usage).

And thanks to this, it can process datasets larger than RAM .

brando · August 12, 2022, 1:09pm

when I use pytorch usually I only preprocess athe data loading stage – when I’m actually training afaik ppl apply the transforms there. I’ve never seen ppl writing down the preprocessed data to disk. That is what is very puzzling to me. It seems to be common? Why not just apply all the transforms at training + use multiple workers to make this efficient?

mariosasko · August 12, 2022, 2:51pm

Oh, I see. Yes, processing ahead of time is common in NLP as we rarely have randomness in transforms, so it’s more efficient to apply them once on data (and cache the result) for all epochs via map. On the other hand, random transforms are super common in CV, so it makes more sense to rerun them each epoch. For that, you can use set_transform(transform), which applies the transform function on access.

brando · August 12, 2022, 6:50pm

ah, I see.

In my case I am doing different “tasks” and prompting the model at train time. So fetching the data in the loader and then manipulating my data at train time makes most sense.

e.g. the encoder receives the natural language (NL) text “Do task X given data Y” and then the decoder would be tasked with predicting the answer given X and Y say it would receive “Z” right shifted.

Topic		Replies	Views
GPTSAN-japanese Summarisation 🤗 Course Projects	0	264	May 8, 2024
How are the inputs tokenized when model deployment? Amazon SageMaker	13	4272	September 3, 2021
Dataset and Training Batching Beginners	1	1431	February 9, 2022
Building a GPT2 dataset from long sequences 🤗Datasets	1	515	September 19, 2022
Trouble batch mapping dataset to tokenizer 🤗Datasets	1	821	June 12, 2023

Do we really preprocess the entire data set with hugging face even when we train very large language models e.g. gpt-3 size?

Related topics