Efficient bucketing implementation

jordiae · May 11, 2022, 5:48pm

Which is the most efficient way to create batches with sequences of similar length to minimize padding in HF datasets? Just calling torchtext’s BucketIterator? torchtext.data — torchtext 0.4.0 documentation? Is there any reference implementation?

conceptofmind · May 11, 2022, 9:10pm

Hello,

This is one option to handle your use case. You can also bucketize the training examples as well.

Select the tokenizer you wish to use with the dataset. For example the pre-trained GPT2 one.
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
Load a dataset of your choosing.
load_data = load_dataset("wikitext", "wikitext-2-v1", split="train")
Concatenate tokenized input examples together and then split them into sequences of exactly 512 tokens. The last batch will be less than 512 so you will need to filter or pad it. Sequence length is arbitrary and can be chosen depending on the application. Make sure that the tokenizer you initially select is not limited to a specific sequence length otherwise, you may get a warning. You can check the configuration file provided when downloading it.

def tokenize(examples):
    seq_length = 512
    examples = tokenizer(examples["text"])
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= seq_length:
        total_length = (total_length // seq_length) * seq_length
    result = {
        k: [t[i : i + seq_length] for i in range(0, total_length, seq_length)]
        for k, t in concatenated_examples.items()
    }
    return result

Map the tokenizer function to the loaded dataset. Remove columns to get input_ids, attention_mask, etc.
tokenized_dataset = load_data.map(tokenize, batched=True, remove_columns= ['text'])
If you have not already filtered or padded the last batch then use drop_last=True to remove it. Load your tokenized dataset into the PyTorch Dataloader. Use the default collate function or define your own. Select a batch size which can fit into memory. Longer tokenized sequence lengths such as 1024 will take up a lot more memory.
train_dataloader = DataLoader(tokenized_dataset, shuffle=True, drop_last=True, collate_fn=default_data_collator, batch_size=8)
Loop through your DataLoader as you normally would when training.
for step, batch in enumerate(train_dataloader):

You can check the length of sequences in the training batches to ensure if you wish.

Hope this can help!

jordiae · May 12, 2022, 11:29am

Hi @conceptofmind, thanks for your detailed answer. I have in mind a seq2seq use case, not a LM one, that’s why bucketing is so important in this case.

conceptofmind · May 12, 2022, 2:52pm

Hi @jordiae ,

These articles on building a PyTorch Text Bucket Iterator with sequences of similar length and dynamic padding may be useful:

Here are other sources with a similar use case as well:

You can also sort and filter sequences by length and then use the .map() function to pad or truncate the rest of the batches.

Hopefully one of these will meet the criterion.

jordiae · May 16, 2022, 2:40pm

Thank you!

Topic		Replies	Views
Padding in datasets 🤗Datasets	6	5060	October 21, 2021
Make text data continuous from DatasetDict 🤗Datasets	1	1180	May 11, 2022
HuggingFace dataset: each element in list of batch should be of equal size 🤗Datasets	3	10408	August 10, 2023
Training with varying lengths of sequences Beginners	0	1628	May 31, 2023
Data sampler based on number of tokens 🤗Transformers	0	730	February 4, 2022

Efficient bucketing implementation

Related topics