Hi,
I’m trying to make French email sentences multilabel classification with certain categories such as commitmemt, proposition, meeting, request, subjectif, etc. .
The first problem I faced is that I don’t have labeled sentences rather I have french emails as dataset). Based on this I found the BC3 dataset (English emails) which has sentences annotated with some of labels listed above. So I came up with this approah; First fine tune a bert multilingual on this BC3 dataset on multilabel classification task and then make a zero-shot transfert learning with the finetuned model (or simply use it in inference) on sentences of my French emails. What do you think about this approach?
So I started by prepropcessing the BC3 dataset and obtain 848 sentences, each of them with their occurences annotations according to each categorty. On the image below, the last 5 columns represent the number of time each annotator labeled a sentence for a specific label.
Are those 848 samples enough to fine tune a Bert multilingual model?
I try to fine tune by representing category as on the image below .
With one epoch, BATCH_SIZE = 4, the loss function did’t converge, rather it oscillates between 0.79 and 0.34.
What kind of advices would you give the solve this kind of problem?
Thanks.