XML RoBERTa Multilanguage NER with OntoNotes 5 dataset


I would like to fine-tune XML RoBERTa for multilanguage NER with OntoNotes 5 dataset, but I really can’t understand how to do that. Honestly, I read the paper and I know the theory behind this process, but I can’t understand how to that with transformers module! I did not find any relevant example for it!

For now, I have my ontonotes5 data in the following form:

('لكن', 'O'),

(‘وزارة الداخلية الباكستانية’, ‘ORG’),
(‘وزارة’, ‘O’),
(‘الداخلية’, ‘O’),
(‘الباكستانية’, ‘O’),
(‘قالت’, ‘O’),
(‘ان’, ‘O’),
(‘11’, ‘CARDINAL’),
(‘11’, ‘O’),
(‘شخصا’, ‘O’),
(‘ً قتلوا’, ‘O’),

and this model: xlm-roberta-large · Hugging Face

Hi @Constantin, there’s a detailed tutorial here on using transformers for NER: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=545PP3o8IrJV

I’ve been able to use it with XLM-R without problems. In your case, the main work will be loading your dataset into a datasets.Dataset object (recommended for fast processing!). For that see the docs here or look at how one of the NER datasets is implemented to understand how the features need to be defined, e.g. GermanNER

1 Like