Efficient bucketing implementation

Which is the most efficient way to create batches with sequences of similar length to minimize padding in HF datasets? Just calling torchtext’s BucketIterator? torchtext.data — torchtext 0.4.0 documentation? Is there any reference implementation?


This is one option to handle your use case. You can also bucketize the training examples as well.

  1. Select the tokenizer you wish to use with the dataset. For example the pre-trained GPT2 one.
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

  2. Load a dataset of your choosing.
    load_data = load_dataset("wikitext", "wikitext-2-v1", split="train")

  3. Concatenate tokenized input examples together and then split them into sequences of exactly 512 tokens. The last batch will be less than 512 so you will need to filter or pad it. Sequence length is arbitrary and can be chosen depending on the application. Make sure that the tokenizer you initially select is not limited to a specific sequence length otherwise, you may get a warning. You can check the configuration file provided when downloading it.

def tokenize(examples):
    seq_length = 512
    examples = tokenizer(examples["text"])
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= seq_length:
        total_length = (total_length // seq_length) * seq_length
    result = {
        k: [t[i : i + seq_length] for i in range(0, total_length, seq_length)]
        for k, t in concatenated_examples.items()
    return result
  1. Map the tokenizer function to the loaded dataset. Remove columns to get input_ids, attention_mask, etc.
    tokenized_dataset = load_data.map(tokenize, batched=True, remove_columns= ['text'])

  2. If you have not already filtered or padded the last batch then use drop_last=True to remove it. Load your tokenized dataset into the PyTorch Dataloader. Use the default collate function or define your own. Select a batch size which can fit into memory. Longer tokenized sequence lengths such as 1024 will take up a lot more memory.
    train_dataloader = DataLoader(tokenized_dataset, shuffle=True, drop_last=True, collate_fn=default_data_collator, batch_size=8)

  3. Loop through your DataLoader as you normally would when training.
    for step, batch in enumerate(train_dataloader):

You can check the length of sequences in the training batches to ensure if you wish.

Hope this can help!

Hi @conceptofmind, thanks for your detailed answer. I have in mind a seq2seq use case, not a LM one, that’s why bucketing is so important in this case.

Hi @jordiae ,

These articles on building a PyTorch Text Bucket Iterator with sequences of similar length and dynamic padding may be useful:

Here are other sources with a similar use case as well:

You can also sort and filter sequences by length and then use the .map() function to pad or truncate the rest of the batches.

Hopefully one of these will meet the criterion.


Thank you!