Dear community,
I am trying to build a dataset for posterior GPT2 pretraining for a language model generator. I have very long examples and I was thinking to use the map method to cut them into 1024 sequence tokens using this function:
def chunk_examples(examples):
chunks = []
for sentence in examples["sentence1"]:
chunks += [sentence[i:i + 50] for i in range(0, len(sentence), 50)]
return {"chunks": chunks}
However, I am not sure which should be the best practice. Should I tokenize the texts before and then, map the function to the dataset to do this kind of “data augmentation” and have more examples of the suitable size (1024 tokens)?
I can find any examples of this kind of practice, neither not using the Dataset Hugging face’s package.