Make text data continuous from DatasetDict

PatrickHaller · May 11, 2022, 9:59am

Hello!

I have a question about processing datasets, and hope to find some opionions here.
Given a dataset,

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 659
    })
   ...
})

which contains text with various sizes in each row. I want to concatenate all data in the rows and then divide them evenly such that in the end every row has the same length.

Is there a “built-in” way to do this?

Greetings,
Patrick

conceptofmind · May 11, 2022, 10:48pm

Hello Patrick,

This is one option to handle your use case.

Select the tokenizer you wish to use with the dataset. For example the pre-trained GPT2 one.
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
Load a dataset of your choosing.
load_data = load_dataset("wikitext", "wikitext-2-v1", split="train")
Concatenate tokenized input examples together and then split them into sequences of exactly 512 tokens. The last batch will likely be less than 512 so you will need to filter or pad it. Sequence length is arbitrary and can be chosen depending on the application. Make sure that the tokenizer you initially select is not limited to a specific sequence length otherwise, you may get a warning. You can check the configuration file provided when downloading it.

def tokenize(examples):
    seq_length = 512
    examples = tokenizer(examples["text"])
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= seq_length:
        total_length = (total_length // seq_length) * seq_length
    result = {
        k: [t[i : i + seq_length] for i in range(0, total_length, seq_length)]
        for k, t in concatenated_examples.items()
    }
    return result

Map the tokenizer function to the loaded dataset. Remove columns to get input_ids, attention_mask, etc.
tokenized_dataset = load_data.map(tokenize, batched=True, remove_columns= ['text'])
Filter out any examples which do not have a sequence length of 512. You can also pad them or set drop_last=True in the PyTorch DataLoader.
filtered_dataset = tokenized_dataset.filter(lambda x: x["input_ids"] > 512)

You can check the length of sequences in the dataset to ensure.

Hope this can help!

Topic		Replies	Views
How did the dataset manages long sentences? 🤗Datasets	1	985	February 15, 2022
Efficient bucketing implementation 🤗Datasets	4	3547	May 16, 2022
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12715	October 6, 2021
Preprocessing of dataset 🤗Tokenizers	0	172	April 10, 2024
Building a GPT2 dataset from long sequences 🤗Datasets	1	516	September 19, 2022

Make text data continuous from DatasetDict

Related topics