Using load_dataset.set_transform() function along with Trainer class

tkon3 · April 26, 2021, 12:48am

Hello,

I’m trying to use the load_dataset.set_transform(…) function along with DataCollatorForLanguageModeling and the Trainer class from the transformers library to pretrain a model. Since I have a large dataset, tokenization does not fit in RAM and using the .map() function uses way too much disk space (> 500Go) which is limited in my case. So I need to tokenize on the fly.

While the set_transform works as expected if I index the dataset, I don’t know why it fails when I plug it with a DataCollatorForLanguageModeling and a Trainer.

tokenizer = ...

    def encode(batch):
        return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")

    train_dataset = load_dataset('text', data_files={'train': txt_train_dataset})
    train_dataset.set_transform(encode)

    validation_dataset = load_dataset('text', data_files={'validation': txt_validation_dataset})
    validation_dataset.set_transform(encode)

    print(train_dataset["train"][:3]) # works as expected: {'input_ids': ..., 'attention_mask': ...}

    data_collator = DataCollatorForLanguageModeling(
            tokenizer=tokenizer, mlm=True, mlm_probability=0.15
        )

    trainer = Trainer(
            model=...,
            args=...,
            data_collator=data_collator,
            train_dataset=train_dataset["train"],
            eval_dataset=validation_dataset["validation"],
            compute_metrics=...,
        )

Im probably missing something.

lewtun · April 26, 2021, 11:10am

hey @tkon3, when you say “it fails” what do you mean exactly? there’s a related thread on set_transform here that might be useful: Understanding set_transform

tkon3 · April 26, 2021, 11:44am

There is a type change. The trainer class fails to loop over it.

File "/home/transformers_datasets_test/model/trainer.py", line 88, in encode
        return tokenizer(batch["text"], truncation=True, max_length=tokenizer.model_max_length)
    KeyError: 'text'

When I check batch instead of batch[“text”], I get an empty dict {} which is not expected

However:

train_dataset["train"][:3]

Works as expected outside the trainer class

lewtun · April 26, 2021, 1:48pm

ah i think that’s because the trainer removes unused columns like text by default (see docs). what happens if you set remove_unused_columns=False in your TrainingArguments?

fyi there is a related issue here that might contain additional info for your use case: ERROR WHEN USING SET_TRANSFORM() · Issue #1867 · huggingface/datasets · GitHub

tkon3 · April 26, 2021, 2:34pm

Thank you, its working now.

For those wondering how to train a large dataset with on the fly tokenization and a line by line .txt setup :

model = ...
training_args = ...
max_length = ...
tokenizer = ...
txt_dataset = ...

def encode(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=max_length)

train_dataset = load_dataset('text', data_files={'train': txt_dataset})
train_dataset.set_transform(encode)

data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=True, mlm_probability=0.15
    )

training_args.remove_unused_columns = False

trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset["train"],
        compute_metrics=...,
    )

Topic		Replies	Views
How to use set_transform when map becomes unfeasible? Intermediate	2	131	June 19, 2024
What is happening in the trainer api, with data collator? Beginners	0	363	April 29, 2023
Fine-tune transformers for language model Beginners	2	661	August 14, 2022
Using IterableDataset with Trainer - `IterableDataset' has no len() 🤗Transformers	7	14281	December 17, 2024
Trouble saving and loading a finetuned model Beginners	1	297	July 7, 2024

Using load_dataset.set_transform() function along with Trainer class

Related topics