Amharic NLP: Newbie where do I start

Greetings

Would like to contribute to Amharic (less-resource language) I noticed resources for this language is next to nil.

Where do I start and build up?
1st datasets?
2nd tokenizers ?
3rd transformers?

Thanks!

Hello @yosiasz , welcome!

Amharic is definitely an under-served language and one for which we’re working on improving coverage, but we do have some resources to get you started.

First, you can look at the list of datasets that have some Amharic text here

For example, you can get about 28M words of Amharic text from the CommonCrawl in the OSCAR dataset with:

import datasets

oscar_am = datasets.load_dataset("oscar", "unshuffled_deduplicated_am")
print(oscar_am["train"][0])

We don’t have any models trained exclusively on Amharic so far, but we do have a few translation models as well as a Language-Agnostic Sentence Encoder you can check out:

import torch
from transformers import BertModel, BertTokenizerFast


tokenizer = BertTokenizerFast.from_pretrained("setu4993/LaBSE")
model = BertModel.from_pretrained("setu4993/LaBSE")
model = model.eval()

am_sentences = oscar_am["train"][:4]["text"]
am_inputs = tokenizer(am_sentences , truncation=True, return_tensors="pt", padding=True)

with torch.no_grad():
    am_outputs = model(**am_inputs )

This will give you vector representations of all of the texts in am_sentences

Hope that helps get you started, would you like to tell us more about what you want to do in Amharic so we can help further?

1 Like

@yjernite thanks very much! I want to fill in any missing gaps. You mention that “We don’t have any models trained exclusively on Amharic” I can see if I can contribute there?

Also when I hear " Language-Agnostic Sentence Encoder" I feel nervous :slight_smile:

haha yup I don’t know what the downstream quality on Amharic specifically is :slight_smile:

So one thing you could do that would be super useful would be to pre-train a BERT-style model for Amharic on OSCAR data, and maybe try fine-tuning it on the am portion of the WikiAnn NER dataset

You can find some instructions in the Language Modeling Tutorial (following the Masked Language Modeling approach)

If you want to train your own tokenizer, you can: it may help you achieve better results. However, the MarianMT models already have a tokenizer that seems to work pretty well for English to Amharic translation - you can load it with:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-am-sv", use_fast=True)

Let us know how that goes and if you run into any trouble training your model!

Also, you should definitely start an introduction thread for Amharic NLP in this forum :wink: You can model it after the ones for Arabic and Korean if you need inspiration :hugs:

1 Like

Since Amharic is morpho complex language like other Semitic languages, Arabic work has been a great help in moving forward . will keep my eyes on them

Shukran!

File "C:\PYTHON\3.6.8\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 58: character maps to <undefined>

when running

import datasets

oscar_am = datasets.load_dataset("oscar", "unshuffled_deduplicated_am")
print(oscar_am["train"][0])

how to specify encoding UTF-8?

Oh I haven’t seen that one before! Could you open an issue in the datasets library? That way we can tag all the relevant people. Is that the full trace or is there more?

1 Like

looks like a windows issue.
Create an issue

upgraded to php 3.9.2 and it works just fine.

1 Like

@yjernite

had this problem while training Amharic BERT you suggested in this discussion.

Hi @israel !

I’d suggest posting in the main forum so we can help you debug this.

There isn’t quite enough information in this screenshot, unfortunately: be sure to post the full notebook you’re running (e.g. as a Colab)!

1 Like

This WikiAnn NER dataset does not have validation file (val.txt) Do I need that to do this

    #datasets = load_dataset("text", data_files={"train": 'data/amner/train'})

all good now :slight_smile:

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=lm_datasets["train"],
        #eval_dataset=lm_datasets["validation"],
        eval_dataset=None,
        data_collator=data_collator,
    )