Training with varying lengths of sequences

alyse · May 31, 2023, 4:47pm

My question is about padding and batching. For context I am training models with the CasualLM task (although I’ve had a similar question with other tasks). My dataset has sequences which vary quite a bit in length, from just a few tokens to so many tokens that I must truncate for the model’s max length.

My code is a bit more involved, but the basic idea is as follows:

from datasets import load_from_disk
from transformers import AutoTokenizer

dataset_location = "path_to_dataset"

dataset = load_from_disk(dataset_location)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-1.4b-deduped")
tokenizer.pad_token = tokenizer.eos_token

def preprocess(example):
    return tokenizer(example['text'], max_length=2048, padding='max_length', truncation=True)

tokenized_dataset = dataset.map(preprocess, batched=True, remove_columns=['text'])

By padding to the max length, I am creating a very large dataset and the batches are always the maximum size the model allows. This increases my training time and memory requirements I believe.

So, main question, is there a way/should I be creating batches that vary in length according to the max length of the batch’s sequences? If so, how do I do that? How do I keep the batches that were truncated together the same batch that gets trained together?

I have gotten around this by sorting my documents by sequence length, manually creating batches with a reasonable length according to the data, and then manually doing my own training loop. This seems excessive and I’m trying to use Trainer instead.

Topic		Replies	Views
Padding in datasets 🤗Datasets	6	5049	October 21, 2021
DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Tokenizers	2	1483	October 3, 2023
How did the dataset manages long sentences? 🤗Datasets	1	985	February 15, 2022
How do you tokenize one long string? Beginners	0	293	June 24, 2023
Sequences shorter than model's input window size 🤗Transformers	2	1173	January 4, 2022

Training with varying lengths of sequences

Related topics