Unbalanced dataset for text classification - best practice

Sivio · March 21, 2024, 7:34pm

Hi,

I am currently finetuning a distilbert-base-multilingual-cased on a dataset which is unbalanced by its nature.

To clearify: I do train it on a list of phishing domains (not urls) vs. “all domains” (from which I did substract the phishing domains) for a text-classification job (via pipeline).

I have a large amount of phishing domains (about 500k domains), and I do have access to about 270 mio (!) domain full domain list.

My problem is: Which is the best practive for training?

My thoughts:

(1) I could use the full phishing domain list and get a random excerpt from the non-phishing domain list to have a 50:50 balance. My problem: Reducing 270 mio domains to about 500 k is not easy - I could use a random shuffle.

(2) I could modify #1 to have a balance of 1:2, 1:3, … in the same way

(3) I could train both full lists, so the balance would be 1:540

Due to the nature of the phishing domain list I cannot “extend” them, even not with historical data.

But I do not know how to fold / reduce the full domain list (the non-phishing part) to be detailed enough in terms of language models.

Any ideas?

Thank you so much!

Topic		Replies	Views
Getting 40% accuracy. Need suggestions to improve! Beginners	12	3019	December 7, 2023
Using EXTREMELY small dataset to finetune BERT 🤗Transformers	6	13129	February 1, 2023
Continuous training of google-bert/bert-base-uncased Beginners	1	112	January 13, 2025
How should I finetune my model for a weirdly labelled dataset? Beginners	0	167	September 12, 2023
Unbalanced training with BERT 🤗Transformers	0	701	July 27, 2020

Unbalanced dataset for text classification - best practice

Related topics