How can I continue to train my fine-tuned model with new datasets?

JannikB · September 14, 2023, 4:46pm

I have fine-tuned with the following code the pre-trained gpt2-medium and got good results in evuluation.

path = '/data'

license_numbers = ['00001-1']

for license_number in license_numbers:
    data = load_dataset('json', data_files={
        'train': os.path.join(path, license_number + '_train_company.json'),
        'test': os.path.join(path, license_number + '_test_company.json')
    })

    tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")
    tokenizer.pad_token = -100

    tokenizer._tokenizer.post_processor = TemplateProcessing(
        single=tokenizer.bos_token + " $A " + tokenizer.eos_token,
        special_tokens=[(tokenizer.eos_token, tokenizer.eos_token_id), (tokenizer.bos_token, tokenizer.bos_token_id)]
    )

    tokenized_data = data.map(
        lambda x: tokenizer(x['text']),
        batched=True,
        num_proc=4,
        remove_columns=data["train"].column_names,
    )

    block_size = 1000

    lm_dataset = tokenized_data.map(group_texts, batched=True, num_proc=4)

    model = AutoModelWithLMHead.from_pretrained("gpt2-medium")
    model.resize_token_embeddings(len(tokenizer))

    output_dir = f"./models/gpt2_medium_00000_1"

    training_args = TrainingArguments(
        output_dir=output_dir, 
        overwrite_output_dir=True,
        num_train_epochs=20,
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        eval_steps=50, 
        save_steps=1600, 
        warmup_steps=10,
        prediction_loss_only=True,
    )

    data_collator = MyDataCollator(tokenizer=tokenizer, mlm=False)

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=lm_dataset['train'].shuffle(seed=42),
        eval_dataset=lm_dataset['test'].shuffle(seed=42),
    )

    trainer.train()
    trainer.save_model()

I now wanted to train the resulting model further with new data to see how much is lost from the old learning, i.e. “catastrophic forgetting”. I tried this with the following code:

...
license_numbers = ['00001-2']

for license_number in license_numbers:
    data = load_dataset('json', data_files={
        'train': os.path.join(path, license_number + '_train_company.json'),
        'test': os.path.join(path, license_number + '_test_company.json')
    })

    tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")
    ...
    model = AutoModelWithLMHead.from_pretrained("./models/gpt2_medium_00000_1")
    ...
    output_dir = f"./models/gpt2_medium_00000_2"

    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        ...
    )
    ...
    trainer.train()
    trainer.save_model()

Now that I am evaluating the new gpt2_medium_00000_2 model, I notice that the results are 90% worse. I don’t think an llm unlearns that much and think there is a flaw in the implementation.

Maybe someone can tell me more about this.

Topic		Replies	Views
Fine tuning and retokenizing Beginners	0	587	May 29, 2022
Need help with gpt2 model Beginners	0	584	July 9, 2023
Fine-tuned transformers model generats nonsensical results Beginners	0	214	July 10, 2024
Fine-Tuning AutoModelWithLMHead Model 🤗Transformers	1	711	January 10, 2022
How to train GPT-2 for text summarization? Models	4	9538	November 24, 2024

How can I continue to train my fine-tuned model with new datasets?

Related topics