Amharic is definitely an under-served language and one for which we’re working on improving coverage, but we do have some resources to get you started.
First, you can look at the list of datasets that have some Amharic text here
For example, you can get about 28M words of Amharic text from the CommonCrawl in the OSCAR dataset with:
We don’t have any models trained exclusively on Amharic so far, but we do have a few translation models as well as a Language-Agnostic Sentence Encoder you can check out:
import torch
from transformers import BertModel, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("setu4993/LaBSE")
model = BertModel.from_pretrained("setu4993/LaBSE")
model = model.eval()
am_sentences = oscar_am["train"][:4]["text"]
am_inputs = tokenizer(am_sentences , truncation=True, return_tensors="pt", padding=True)
with torch.no_grad():
am_outputs = model(**am_inputs )
This will give you vector representations of all of the texts in am_sentences
Hope that helps get you started, would you like to tell us more about what you want to do in Amharic so we can help further?
@yjernite thanks very much! I want to fill in any missing gaps. You mention that “We don’t have any models trained exclusively on Amharic” I can see if I can contribute there?
Also when I hear " Language-Agnostic Sentence Encoder" I feel nervous
haha yup I don’t know what the downstream quality on Amharic specifically is
So one thing you could do that would be super useful would be to pre-train a BERT-style model for Amharic on OSCAR data, and maybe try fine-tuning it on the am portion of the WikiAnn NER dataset
You can find some instructions in the Language Modeling Tutorial (following the Masked Language Modeling approach)
If you want to train your own tokenizer, you can: it may help you achieve better results. However, the MarianMT models already have a tokenizer that seems to work pretty well for English to Amharic translation - you can load it with:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-am-sv", use_fast=True)
Let us know how that goes and if you run into any trouble training your model!
Also, you should definitely start an introduction thread for Amharic NLP in this forum You can model it after the ones for Arabic and Korean if you need inspiration
File "C:\PYTHON\3.6.8\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 58: character maps to <undefined>
Oh I haven’t seen that one before! Could you open an issue in the datasets library? That way we can tag all the relevant people. Is that the full trace or is there more?