I want to do NER task on news articles that are in dozens of languages. Is the best option to go for xlm-roberta-large-finetuned-conll03-english
? I read that XLM models fine-tuned for a language work well in other languages as well. My main issue is that this model is too big. Should I go for language specific smaller models if I already know what language I’m dealing with?
Also I’m curious, why does the xlm-roberta-large-finetuned-conll03-german
have so much more downloads than the English one?
Hi @goutham794,
you could train a multi-lingual NER model on the WikiANN dataset (or better: use the train/dev/test partitioned from GitHub - afshinrahimi/mmner: Massively Multilingual Transfer for NER).
But fine-tuning one big multi-lingual NER model could be very complicated (fine-tuning instabilities). And you should keep in mind, that WikiANN only has three label types.
If you already know what languages you want to cover, then a better way would be to train “mono-lingual” models + just search for NER datasets for your desired languages. Good resource is: