Do you need to zero your gradients for BART?
(I’ve not used Bart, but in training Bert I need to use model.zero_grad before passing each batch of data to the model).
Does your data look similar to the data Bart was originally trained on? If it is totally different then your model could get worse before it gets better. What are you hoping it will learn from your new data?