haha yup I don’t know what the downstream quality on Amharic specifically is
So one thing you could do that would be super useful would be to pre-train a BERT-style model for Amharic on OSCAR data, and maybe try fine-tuning it on the am
portion of the WikiAnn NER dataset
You can find some instructions in the Language Modeling Tutorial (following the Masked Language Modeling approach)
If you want to train your own tokenizer, you can: it may help you achieve better results. However, the MarianMT models already have a tokenizer that seems to work pretty well for English to Amharic translation - you can load it with:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-am-sv", use_fast=True)
Let us know how that goes and if you run into any trouble training your model!
Also, you should definitely start an introduction thread for Amharic NLP in this forum You can model it after the ones for Arabic and Korean if you need inspiration