Hello!
For a course I’m teaching I’m creating some examples on how to fine-tune models for language generation. I’m starting off following the excellent fine-tune a language model colab notebook.
It works great for me, however, the particular use case I’d like to demonstrate involves a dataset of single sentences / short paragraphs (unrelated to each other). So rather than concatenate arbitrarily into one seamless stream of text divided into blocks I’d like the training process to learn/maintain the start and end of each training sample. I’m investigating rewriting the group_texts()
function below, but wondering if anyone has any helpful tips or pointers. . or if there is an example that works this way already?
Researching this a bit, I see that the huggingtweets project adds markers into the dataset
<|endoftext|>This is my first tweet!<|endoftext|>Second tweet already!<|endoftext|>
Is this the standard approach? Or is coming up with a maximum length for each sample and adding “padding” another viable approach?
Thank you!
Dan
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result