My name is Robin, and I’m currently doing an internship. I’d like to ask you a question regarding fine-tuning a BERT model.
I’m working on a multi-label classification task with 301 labels. I’m using a BERT model with Adapters and LoRA. My dataset is relatively large (~1.5M samples), but I reduced it to around 1.1M to balance the classes — approximately 5000 occurrences per label.
However, during fine-tuning, I notice that the same few classes always dominate the predictions, despite the dataset being balanced.
Do you have any advice on what might be causing this, or what I could try to fix it?
Since this fine-tuning is large-scale, I think it would be better to first prepare a small training loop for testing and try a few small trainings.
Test tuning without LoRA or with rank set to 64 to see if LoRA has a significant impact. Testing by borrowing hyperparameters from successful examples of similar models/tasks by other people… etc.
Hello,
Thanks for the advice.
I had already tested on another smaller dataset with only 6 labels for multi-label classification, and I had obtained more than decent results, still using LoRa, where I saved about 5 minutes on a 50-minute training time.
Hello. I see. So, it seems like the problem can be narrowed down to a few issues, such as problems that arise when there are many classes to teach, or problems such as the base model weights being stubborn…
I’m not very familiar with NLP itself, so I think I can only help with troubleshooting…
Would it make sense to try fine-tuning the model directly without using LoRa?
Yeah. I think so. Bugs aside, using LoRA (PEFT) can change the content and quality of learning for better or worse. Especially when pre-training a model from scratch, it is usually safer to do so without LoRA first.
By the way, I thought that bias due to class overlapping might occur. This is unlikely to be a problem when there are few classes, so it may be one of the causes in this case.