How can I continue to train my fine-tuned model with new datasets?

I have fine-tuned with the following code the pre-trained gpt2-medium and got good results in evuluation.

path = '/data'

license_numbers = ['00001-1']

for license_number in license_numbers:
    data = load_dataset('json', data_files={
        'train': os.path.join(path, license_number + '_train_company.json'),
        'test': os.path.join(path, license_number + '_test_company.json')
    })

    tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")
    tokenizer.pad_token = -100

    tokenizer._tokenizer.post_processor = TemplateProcessing(
        single=tokenizer.bos_token + " $A " + tokenizer.eos_token,
        special_tokens=[(tokenizer.eos_token, tokenizer.eos_token_id), (tokenizer.bos_token, tokenizer.bos_token_id)]
    )

    tokenized_data = data.map(
        lambda x: tokenizer(x['text']),
        batched=True,
        num_proc=4,
        remove_columns=data["train"].column_names,
    )

    block_size = 1000

    lm_dataset = tokenized_data.map(group_texts, batched=True, num_proc=4)

    model = AutoModelWithLMHead.from_pretrained("gpt2-medium")
    model.resize_token_embeddings(len(tokenizer))

    output_dir = f"./models/gpt2_medium_00000_1"

    training_args = TrainingArguments(
        output_dir=output_dir, 
        overwrite_output_dir=True,
        num_train_epochs=20,
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        eval_steps=50, 
        save_steps=1600, 
        warmup_steps=10,
        prediction_loss_only=True,
    )

    data_collator = MyDataCollator(tokenizer=tokenizer, mlm=False)

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=lm_dataset['train'].shuffle(seed=42),
        eval_dataset=lm_dataset['test'].shuffle(seed=42),
    )

    trainer.train()
    trainer.save_model()

I now wanted to train the resulting model further with new data to see how much is lost from the old learning, i.e. “catastrophic forgetting”. I tried this with the following code:

...
license_numbers = ['00001-2']

for license_number in license_numbers:
    data = load_dataset('json', data_files={
        'train': os.path.join(path, license_number + '_train_company.json'),
        'test': os.path.join(path, license_number + '_test_company.json')
    })

    tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")
    ...
    model = AutoModelWithLMHead.from_pretrained("./models/gpt2_medium_00000_1")
    ...
    output_dir = f"./models/gpt2_medium_00000_2"

    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        ...
    )
    ...
    trainer.train()
    trainer.save_model()

Now that I am evaluating the new gpt2_medium_00000_2 model, I notice that the results are 90% worse. I don’t think an llm unlearns that much and think there is a flaw in the implementation.

Maybe someone can tell me more about this.