How to continue training on another dataset?

beyond · April 13, 2022, 7:33am

Hi, I want to do some language model pre-training, using the Trainer API.

Assume I have two datasets wikitext and bookcorpus. I want to first train on wikitext and then on bookcorpus, and I want to save the checkpoint after training on wikitext, then continue training on bookcorpus and save the later checkpoints.

I wish to have the checkpoints something like this:

checkpoint-500 (only wikitext)
checkpoint-1000 (only wikitext)
checkpoint-1500 (only wikitext)
checkpoint-1800 (finished training on wikitext)
checkpoint-2300 (continue training on bookcorpus)
...

I don’t want to mix the two datasets together, because I want to analyse what’s the difference after training on another dataset. I want to know how to achieve this?

Could anyone help me?

beyond · April 14, 2022, 3:12pm

@sgugger Could you please have a look?

Topic		Replies	Views
How to continue training a model from where it left off? 🤗Transformers	0	183	September 5, 2024
Training on multiple datasets Beginners	0	474	January 23, 2024
Continuing Pre Training from Model Checkpoint Models	12	41849	January 13, 2025
Continue Pre-Training Roberta Intermediate	3	2681	May 18, 2023
How to continue BERT training 🤗Transformers	1	1340	March 4, 2022

How to continue training on another dataset?

Related topics