Amharic NLP: Newbie where do I start

yjernite · February 19, 2021, 5:39pm

haha yup I don’t know what the downstream quality on Amharic specifically is

So one thing you could do that would be super useful would be to pre-train a BERT-style model for Amharic on OSCAR data, and maybe try fine-tuning it on the am portion of the WikiAnn NER dataset

You can find some instructions in the Language Modeling Tutorial (following the Masked Language Modeling approach)

If you want to train your own tokenizer, you can: it may help you achieve better results. However, the MarianMT models already have a tokenizer that seems to work pretty well for English to Amharic translation - you can load it with:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-am-sv", use_fast=True)

Let us know how that goes and if you run into any trouble training your model!

Also, you should definitely start an introduction thread for Amharic NLP in this forum You can model it after the ones for Arabic and Korean if you need inspiration

Topic		Replies	Views
Amharic NLP - Train BERT-style model Models	3	358	March 1, 2021
Amharic BERT Training Beginners	2	473	February 23, 2021
Habesha BERT Amharic Model cards	0	1699	March 5, 2021
Amharic NLP - Introductions Languages at Hugging Face	5	895	February 24, 2021
Loading Amharic common voice dataset 🤗Datasets	1	351	June 8, 2022

Amharic NLP: Newbie where do I start

Related topics