Building a GPT2 dataset from long sequences

Dear community,

I am trying to build a dataset for posterior GPT2 pretraining for a language model generator. I have very long examples and I was thinking to use the map method to cut them into 1024 sequence tokens using this function:

def chunk_examples(examples):
    chunks = []
    for sentence in examples["sentence1"]:
        chunks += [sentence[i:i + 50] for i in range(0, len(sentence), 50)]
    return {"chunks": chunks}

However, I am not sure which should be the best practice. Should I tokenize the texts before and then, map the function to the dataset to do this kind of “data augmentation” and have more examples of the suitable size (1024 tokens)?

I can find any examples of this kind of practice, neither not using the Dataset Hugging face’s package.

Hi ! You should tokenize before chunking the examples, otherwise you can’t control how many tokens are present in each chunked example.