Hello, I’m a complete beginner to training RAG models, and I’ve been trying to train a custom RAG model on a custom knowledge dataset following the instructions provided on the HuggingFace documentation as well as the RAG repository. So far I have been able to convert my custom knowledge dataset into a .csv file as instructed, and provided the 6 files (train.source, train.target, val.source, val.target, test.source, test.target) needed for finetuning.
I run the finetune_rag.sh file provided in the repository with parameters as follows (pretty much the default settings in finetune_rag.sh except I provided my own training data and output as requested):
python3 ./finetune_rag.py \
--data_dir mydata \
--output_dir my_model_ray2 \
--model_name_or_path facebook/rag-token-base \
--model_type rag_token \
--default_root_dir my_save \
--accelerator gpu \
--gpus 1 \
--index_name custom \
--passages_path my_train/my_knowledge_dataset \
--index_path my_train/my_knowledge_dataset_hnsw_index.faiss \
--profile \
--do_train \
--do_predict \
--fp16 \
--n_val -1 \
--train_batch_size 2 \
--eval_batch_size 1 \
--max_source_length 128 \
--max_target_length 25 \
--val_max_target_length 25 \
--test_max_target_length 25 \
--label_smoothing 0.1 \
--dropout 0.1 \
--attention_dropout 0.1 \
--weight_decay 0.001 \
--adam_epsilon 1e-08 \
--max_grad_norm 0.1 \
--lr_scheduler polynomial \
--learning_rate 3e-05 \
--num_train_epochs 100 \
--warmup_steps 500 \
--gradient_accumulation_steps 1 \
--distributed_retriever ray \
--num_retrieval_workers 4 \
--checkpoint_callback
At the end of the training, I get the following error:
finetune_rag.py 649 <module>
main(args)
finetune_rag.py 629 main
trainer.test()
trainer.py 911 test
return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
trainer.py 685 _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
trainer.py 949 _test_impl
self.tested_ckpt_path = self.__set_ckpt_path(
trainer.py 1419 __set_ckpt_path
raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException:
`.test(ckpt_path="best")` is set but `ModelCheckpoint` is not configured to save the best model.
I used --enable_checkpointing instead of --checkpoint_callback, but I get the same results.
Here are my relevant package information:
torch==2.3.0
pytorch-lightning==1.5.10
accelerate==0.29.3
tokenizers==0.19.1
transformers==4.40.0
addendum: I know the documentation says I should run the script in pytorch-lightning==1.3.1 but when I do, the training script won’t run at all because I get some importError about cannot import name 'get_num_classes' from 'torchmetrics.utilities.data'
. The training script only seems to run when I set pytorch-lightning to 1.5.10
addendum2: Just tried to pass ckpt_path=None
to trainer.test() as suggested on StackOverflow, but I still get the same error message as above.
When I checked the code for finetune_rag.py I see that there is a function that saves checkpoints, but it is not called anywhere during the training.
@pl.utilities.rank_zero_only
def on_save_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
save_path = self.output_dir.joinpath("checkpoint{}".format(self.step_count))
self.model.config.save_step = self.step_count
self.model.save_pretrained(save_path)
self.tokenizer.save_pretrained(save_path)
Can anyone provide me with a pointer to what I’m doing wrong? Or how can I set ModelCheckPoint? Any guidance would be appreciated. Thanks in advance!
addendum3: I ended up having to change some lines in callback_rag.py
and finetune_rag.py
to make things work… Instead of saving model_best I made it save model_last and going from there.