Building a GPT2 dataset from long sequences

mdelas · September 16, 2022, 10:08pm

Dear community,

I am trying to build a dataset for posterior GPT2 pretraining for a language model generator. I have very long examples and I was thinking to use the map method to cut them into 1024 sequence tokens using this function:

def chunk_examples(examples):
    chunks = []
    for sentence in examples["sentence1"]:
        chunks += [sentence[i:i + 50] for i in range(0, len(sentence), 50)]
    return {"chunks": chunks}

However, I am not sure which should be the best practice. Should I tokenize the texts before and then, map the function to the dataset to do this kind of “data augmentation” and have more examples of the suitable size (1024 tokens)?

I can find any examples of this kind of practice, neither not using the Dataset Hugging face’s package.

lhoestq · September 19, 2022, 10:54am

Hi ! You should tokenize before chunking the examples, otherwise you can’t control how many tokens are present in each chunked example.

Topic		Replies	Views
Map tokenization and posterior to smaller substrings 🤗Tokenizers	0	367	September 29, 2022
How did the dataset manages long sentences? 🤗Datasets	1	985	February 15, 2022
Training GPT-2 from scratch Beginners	2	1230	August 3, 2020
GPT2 long text approach 🤗Tokenizers	0	556	December 20, 2022
Efficient bucketing implementation 🤗Datasets	4	3548	May 16, 2022

Building a GPT2 dataset from long sequences

Related topics