Saving weights and checkpoints

MaryaAI · April 13, 2022, 12:11pm

I used to use checkpoint callback in Keras, Is there any alternative in Huggingface?
If I re-run the training cell it continues from the last loss so it is automatically saved?
Could anyone please explain more about how Huggingface saves partial checkpoints so I can continue later from this point?

BramVanroy · April 13, 2022, 1:39pm

Yes, you can control how to deal with checkpoints within the Trainer class. Have a read through the documentation, which should help you.

MaryaAI · April 14, 2022, 7:19am

Thanks, BramVanroy
I think this documentation is for PyTorch and I am currently using TensorFlow.
So that means there is no such solution now in TensorFlow

BramVanroy · April 14, 2022, 8:56am

I am no familiar with the TF code base of the libraries, but it seems that some checkpointing is implemented:

github.com

huggingface/transformers/blob/195fbbb6cfc6c6279cef6be12b05a53d589b0de8/src/transformers/trainer_tf.py#L502-L526

      
        
            with self.args.strategy.scope():
                self.create_optimizer_and_scheduler(num_training_steps=t_total)
                folder = os.path.join(self.args.output_dir, PREFIX_CHECKPOINT_DIR)
                ckpt = tf.train.Checkpoint(optimizer=self.optimizer, model=self.model)
                self.model.ckpt_manager = tf.train.CheckpointManager(ckpt, folder, max_to_keep=self.args.save_total_limit)
            
            
    iterations = self.optimizer.iterations
                epochs_trained = 0
                steps_trained_in_current_epoch = 0
                if self.model.ckpt_manager.latest_checkpoint:
            
            
        logger.info(
                        f"Checkpoint file {self.model.ckpt_manager.latest_checkpoint} found and restoring from checkpoint"
                    )
                    ckpt.restore(self.model.ckpt_manager.latest_checkpoint).expect_partial()
            
            
        self.global_step = iterations.numpy()
            
            
        epochs_trained = self.global_step // self.steps_per_epoch
                    steps_trained_in_current_epoch = self.global_step % self.steps_per_epoch

This file has been truncated. show original

Maybe someone else can chime in, who knows more about TF.

Topic		Replies	Views
Checkpointing in each step 🤗Transformers	1	947	January 20, 2021
Saving checkpoints only on improvement 🤗Transformers	2	74	February 8, 2025
Checkpoint vs model weight Beginners	2	4779	October 12, 2020
Checkpoints - still confused Beginners	0	1644	July 30, 2022
Saving only the best performing checkpoint 🤗Transformers	19	18208	May 23, 2023

Saving weights and checkpoints

Related topics