Retain start and end of training samples for fine-tuning language modeling


For a course I’m teaching I’m creating some examples on how to fine-tune models for language generation. I’m starting off following the excellent fine-tune a language model colab notebook.

It works great for me, however, the particular use case I’d like to demonstrate involves a dataset of single sentences / short paragraphs (unrelated to each other). So rather than concatenate arbitrarily into one seamless stream of text divided into blocks I’d like the training process to learn/maintain the start and end of each training sample. I’m investigating rewriting the group_texts() function below, but wondering if anyone has any helpful tips or pointers. . or if there is an example that works this way already?

Researching this a bit, I see that the huggingtweets project adds markers into the dataset

<|endoftext|>This is my first tweet!<|endoftext|>Second tweet already!<|endoftext|>

Is this the standard approach? Or is coming up with a maximum length for each sample and adding “padding” another viable approach?

Thank you!

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    result["labels"] = result["input_ids"].copy()
    return result

Would it be possible for you to add <|endoftext|> to your examples before they reach group_texts() and avoid the need to modify the function?

For example, I currently have all my input examples for fine-tuning as separate text files. A simple script compiles them into train and validation files. That script handles prepending <|endoftext|> to the first example and appending it to every example.

That said, I have some related questions about <|endoftext|>.

I’m not sure I fully understand the mechanics of <|endoftext|>. Does something happen internally to mask out everything before/after an <|endoftext|> token, like an attention mask? Or is it just a token like any other that’s convenient to use for stopping when the model produces it during generation?

Suppose I were going to fine-tune on a bunch of fairy tales that all start with “Once upon a time” and end with “The end.”

Is pasting one <|endoftext|> between the end of one fairy tale and the beginning of the next enough to prevent a causal LM head from always predicting “Once upon a time” after seeing “The end?”

Would that same <|endoftext|> token prevent an open-domain question-answering model from identifying the three bears as a candidate answer for “In whose house did the princess stay?” if Goldilocks and Snow White happened to end up sharing a block?

Or would it be better to pad all examples, set the block size to the padded length, and use attention masks to ignore the padding?

I think this is very similar to your question, and I hope that if anyone is willing to help either of us, it will benefit us both.

1 Like