Looking for a multi language model to categorize news articles by their titles that can be fine-tuned with labeled data. It shouldn’t rely on word-based methods. I have used SetFit, but the results aren’t great. It would be helpful if anyone can point me in the right direction.
I’m looking to finetune the model on news title
Using a simple prompt with Gemma / LLaMA will do the work just fine.
Wil a (large) few-shot example you will see amazing results.
You can use the free tier on GROQ or GEMINI.
And you won’t need to FT.
If this is an academic project, and you have limited resources, you can fine-tune an encoder to that.
Set-Fit
is a great technique to do so, specifically when the number of examples for each label is small.
Flow for Training an Encoder:
- Set up your dataset to be
[titles<string>] [label<int>]
- Divide the dataset into train-val-test.
- Fine-tune a multilingual BERT/BART/roBERTa/…
- Go to the beach.
Good luck.
Sahar
Thanks for the the guidance.
Basically I’m trying to train a model and set it up locally to process 1K plus news on daily basis. Setfit provided good prediction in the start but when new scenarios arise the accuracy falls. As for Bert i’m training one now. Hope the results are good
Please be advised that a model like SetFit has the propensity to easily become overfitted to your training data (I mention this since you’ve indicated you are getting good initial prediction/results during training). It’s a bit tricky to start with SetFit and contrastive learning, as you might need to be careful with your parameters.
Additionally, if you have time, consider using SetFit to create a few hundred labeled data samples. Validate the data manually and then use them to train another vanilla classification model (e.g., RoBERTa).