Retain start and end of training samples for fine-tuning language modeling

shiffman · March 22, 2022, 1:38pm

Hello!

For a course I’m teaching I’m creating some examples on how to fine-tune models for language generation. I’m starting off following the excellent fine-tune a language model colab notebook.

It works great for me, however, the particular use case I’d like to demonstrate involves a dataset of single sentences / short paragraphs (unrelated to each other). So rather than concatenate arbitrarily into one seamless stream of text divided into blocks I’d like the training process to learn/maintain the start and end of each training sample. I’m investigating rewriting the group_texts() function below, but wondering if anyone has any helpful tips or pointers. . or if there is an example that works this way already?

Researching this a bit, I see that the huggingtweets project adds markers into the dataset

Is this the standard approach? Or is coming up with a maximum length for each sample and adding “padding” another viable approach?

Thank you!
Dan

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

jdwx · March 22, 2022, 3:38pm

Would it be possible for you to add <|endoftext|> to your examples before they reach group_texts() and avoid the need to modify the function?

For example, I currently have all my input examples for fine-tuning as separate text files. A simple script compiles them into train and validation files. That script handles prepending <|endoftext|> to the first example and appending it to every example.

That said, I have some related questions about <|endoftext|>.

I’m not sure I fully understand the mechanics of <|endoftext|>. Does something happen internally to mask out everything before/after an <|endoftext|> token, like an attention mask? Or is it just a token like any other that’s convenient to use for stopping when the model produces it during generation?

Suppose I were going to fine-tune on a bunch of fairy tales that all start with “Once upon a time” and end with “The end.”

Is pasting one <|endoftext|> between the end of one fairy tale and the beginning of the next enough to prevent a causal LM head from always predicting “Once upon a time” after seeing “The end?”

Would that same <|endoftext|> token prevent an open-domain question-answering model from identifying the three bears as a candidate answer for “In whose house did the princess stay?” if Goldilocks and Snow White happened to end up sharing a block?

Or would it be better to pad all examples, set the block size to the padded length, and use attention masks to ignore the padding?

I think this is very similar to your question, and I hope that if anyone is willing to help either of us, it will benefit us both.

Topic		Replies	Views
Query about group_texts in run_mlm_no_trainer.py Beginners	0	647	April 12, 2022
Sentence Transformer Fine-Tuning Dataset Curation Clarification Beginners	0	556	September 3, 2022
How to get model output to retain \n from dataset? Beginners	0	291	July 29, 2022
Data format in run_lm_fine_tuning.py Beginners	2	415	September 8, 2020
How to fine-tune GPT on my own data for text generation Beginners	0	2188	January 17, 2022

Retain start and end of training samples for fine-tuning language modeling

Related topics