Finetune_rag.py won't save checkpoints

Hello, I’m a complete beginner to training RAG models, and I’ve been trying to train a custom RAG model on a custom knowledge dataset following the instructions provided on the HuggingFace documentation as well as the RAG repository. So far I have been able to convert my custom knowledge dataset into a .csv file as instructed, and provided the 6 files (train.source, train.target, val.source, val.target, test.source, test.target) needed for finetuning.

I run the finetune_rag.sh file provided in the repository with parameters as follows (pretty much the default settings in finetune_rag.sh except I provided my own training data and output as requested):

python3 ./finetune_rag.py \
    --data_dir mydata \
    --output_dir my_model_ray2 \
    --model_name_or_path facebook/rag-token-base \
    --model_type rag_token \
    --default_root_dir my_save \
    --accelerator gpu \
    --gpus 1 \
    --index_name custom \
    --passages_path my_train/my_knowledge_dataset \
    --index_path my_train/my_knowledge_dataset_hnsw_index.faiss \
    --profile \
    --do_train \
    --do_predict \
    --fp16 \
    --n_val -1 \
    --train_batch_size 2 \
    --eval_batch_size 1 \
    --max_source_length 128 \
    --max_target_length 25 \
    --val_max_target_length 25 \
    --test_max_target_length 25 \
    --label_smoothing 0.1 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --weight_decay 0.001 \
    --adam_epsilon 1e-08 \
    --max_grad_norm 0.1 \
    --lr_scheduler polynomial \
    --learning_rate 3e-05 \
    --num_train_epochs 100 \
    --warmup_steps 500 \
    --gradient_accumulation_steps 1 \
    --distributed_retriever ray \
    --num_retrieval_workers 4 \
    --checkpoint_callback

At the end of the training, I get the following error:

finetune_rag.py 649 <module>
main(args)

finetune_rag.py 629 main
trainer.test()

trainer.py 911 test
return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)

trainer.py 685 _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)

trainer.py 949 _test_impl
self.tested_ckpt_path = self.__set_ckpt_path(

trainer.py 1419 __set_ckpt_path
raise MisconfigurationException(

pytorch_lightning.utilities.exceptions.MisconfigurationException:
`.test(ckpt_path="best")` is set but `ModelCheckpoint` is not configured to save the best model.

I used --enable_checkpointing instead of --checkpoint_callback, but I get the same results.

Here are my relevant package information:

torch==2.3.0
pytorch-lightning==1.5.10
accelerate==0.29.3
tokenizers==0.19.1
transformers==4.40.0

addendum: I know the documentation says I should run the script in pytorch-lightning==1.3.1 but when I do, the training script won’t run at all because I get some importError about cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' . The training script only seems to run when I set pytorch-lightning to 1.5.10

addendum2: Just tried to pass ckpt_path=None to trainer.test() as suggested on StackOverflow, but I still get the same error message as above.

When I checked the code for finetune_rag.py I see that there is a function that saves checkpoints, but it is not called anywhere during the training.

    @pl.utilities.rank_zero_only
    def on_save_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
        save_path = self.output_dir.joinpath("checkpoint{}".format(self.step_count))
        self.model.config.save_step = self.step_count
        self.model.save_pretrained(save_path)
        self.tokenizer.save_pretrained(save_path)

Can anyone provide me with a pointer to what I’m doing wrong? Or how can I set ModelCheckPoint? Any guidance would be appreciated. Thanks in advance!

addendum3: I ended up having to change some lines in callback_rag.py and finetune_rag.py to make things work… Instead of saving model_best I made it save model_last and going from there.