Help with BERT Adapter + LoRA for Multi-Label Classification (301 classes)

My name is Robin, and I’m currently doing an internship. I’d like to ask you a question regarding fine-tuning a BERT model.

I’m working on a multi-label classification task with 301 labels. I’m using a BERT model with Adapters and LoRA. My dataset is relatively large (~1.5M samples), but I reduced it to around 1.1M to balance the classes — approximately 5000 occurrences per label.

However, during fine-tuning, I notice that the same few classes always dominate the predictions, despite the dataset being balanced.

Do you have any advice on what might be causing this, or what I could try to fix it?

Thank you in advance!
Robin

1 Like

Since this fine-tuning is large-scale, I think it would be better to first prepare a small training loop for testing and try a few small trainings.

Test tuning without LoRA or with rank set to 64 to see if LoRA has a significant impact. Testing by borrowing hyperparameters from successful examples of similar models/tasks by other people… etc.

Hello,
Thanks for the advice.
I had already tested on another smaller dataset with only 6 labels for multi-label classification, and I had obtained more than decent results, still using LoRa, where I saved about 5 minutes on a 50-minute training time.

1 Like

Hello. I see. So, it seems like the problem can be narrowed down to a few issues, such as problems that arise when there are many classes to teach, or problems such as the base model weights being stubborn…:thinking:

Do you have any advice to help guide my research?
Would it make sense to try fine-tuning the model directly without using LoRa?

1 Like

Do you have any advice to help guide my research?

I’m not very familiar with NLP itself, so I think I can only help with troubleshooting…:sweat_smile:

Would it make sense to try fine-tuning the model directly without using LoRa?

Yeah. I think so. Bugs aside, using LoRA (PEFT) can change the content and quality of learning for better or worse. Especially when pre-training a model from scratch, it is usually safer to do so without LoRA first.

By the way, I thought that bias due to class overlapping might occur. This is unlikely to be a problem when there are few classes, so it may be one of the causes in this case.

Alright, I’ll launch a new training run using only BERT to see how it goes. Thanks for the advice!

1 Like