Using load_dataset.set_transform() function along with Trainer class

Hello,

I’m trying to use the load_dataset.set_transform(…) function along with DataCollatorForLanguageModeling and the Trainer class from the transformers library to pretrain a model. Since I have a large dataset, tokenization does not fit in RAM and using the .map() function uses way too much disk space (> 500Go) which is limited in my case. So I need to tokenize on the fly.

While the set_transform works as expected if I index the dataset, I don’t know why it fails when I plug it with a DataCollatorForLanguageModeling and a Trainer.

tokenizer = ...

    def encode(batch):
        return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")

    train_dataset = load_dataset('text', data_files={'train': txt_train_dataset})
    train_dataset.set_transform(encode)

    validation_dataset = load_dataset('text', data_files={'validation': txt_validation_dataset})
    validation_dataset.set_transform(encode)

    print(train_dataset["train"][:3]) # works as expected: {'input_ids': ..., 'attention_mask': ...}

    data_collator = DataCollatorForLanguageModeling(
            tokenizer=tokenizer, mlm=True, mlm_probability=0.15
        )

    trainer = Trainer(
            model=...,
            args=...,
            data_collator=data_collator,
            train_dataset=train_dataset["train"],
            eval_dataset=validation_dataset["validation"],
            compute_metrics=...,
        )

Im probably missing something.

1 Like

hey @tkon3, when you say “it fails” what do you mean exactly? there’s a related thread on set_transform here that might be useful: Understanding set_transform

There is a type change. The trainer class fails to loop over it.

File "/home/transformers_datasets_test/model/trainer.py", line 88, in encode
        return tokenizer(batch["text"], truncation=True, max_length=tokenizer.model_max_length)
    KeyError: 'text'

When I check batch instead of batch[“text”], I get an empty dict {} which is not expected

However:

train_dataset["train"][:3]

Works as expected outside the trainer class

ah i think that’s because the trainer removes unused columns like text by default (see docs). what happens if you set remove_unused_columns=False in your TrainingArguments?

fyi there is a related issue here that might contain additional info for your use case: ERROR WHEN USING SET_TRANSFORM() · Issue #1867 · huggingface/datasets · GitHub

3 Likes

Thank you, its working now.

For those wondering how to train a large dataset with on the fly tokenization and a line by line .txt setup :slight_smile: :

model = ...
training_args = ...
max_length = ...
tokenizer = ...
txt_dataset = ...

def encode(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=max_length)

train_dataset = load_dataset('text', data_files={'train': txt_dataset})
train_dataset.set_transform(encode)

data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=True, mlm_probability=0.15
    )

training_args.remove_unused_columns = False

trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset["train"],
        compute_metrics=...,
    )
4 Likes