Hi,
I am currently finetuning a distilbert-base-multilingual-cased on a dataset which is unbalanced by its nature.
To clearify: I do train it on a list of phishing domains (not urls) vs. “all domains” (from which I did substract the phishing domains) for a text-classification job (via pipeline).
I have a large amount of phishing domains (about 500k domains), and I do have access to about 270 mio (!) domain full domain list.
My problem is: Which is the best practive for training?
My thoughts:
(1) I could use the full phishing domain list and get a random excerpt from the non-phishing domain list to have a 50:50 balance. My problem: Reducing 270 mio domains to about 500 k is not easy - I could use a random shuffle.
(2) I could modify #1 to have a balance of 1:2, 1:3, … in the same way
(3) I could train both full lists, so the balance would be 1:540
Due to the nature of the phishing domain list I cannot “extend” them, even not with historical data.
But I do not know how to fold / reduce the full domain list (the non-phishing part) to be detailed enough in terms of language models.
Any ideas?
Thank you so much!