Amharic NLP: Newbie where do I start

yosiasz · February 19, 2021, 1:10am

Greetings

Would like to contribute to Amharic (less-resource language) I noticed resources for this language is next to nil.

Where do I start and build up?
1st datasets?
2nd tokenizers ?
3rd transformers?

https://github.com/huggingface/tokenizers/issues/203#issue-586736250

Thanks!

yjernite · February 19, 2021, 2:14pm

Hello @yosiasz , welcome!

Amharic is definitely an under-served language and one for which we’re working on improving coverage, but we do have some resources to get you started.

First, you can look at the list of datasets that have some Amharic text here

For example, you can get about 28M words of Amharic text from the CommonCrawl in the OSCAR dataset with:

import datasets

oscar_am = datasets.load_dataset("oscar", "unshuffled_deduplicated_am")
print(oscar_am["train"][0])

We don’t have any models trained exclusively on Amharic so far, but we do have a few translation models as well as a Language-Agnostic Sentence Encoder you can check out:

import torch
from transformers import BertModel, BertTokenizerFast


tokenizer = BertTokenizerFast.from_pretrained("setu4993/LaBSE")
model = BertModel.from_pretrained("setu4993/LaBSE")
model = model.eval()

am_sentences = oscar_am["train"][:4]["text"]
am_inputs = tokenizer(am_sentences , truncation=True, return_tensors="pt", padding=True)

with torch.no_grad():
    am_outputs = model(**am_inputs )

This will give you vector representations of all of the texts in am_sentences

Hope that helps get you started, would you like to tell us more about what you want to do in Amharic so we can help further?

yosiasz · February 19, 2021, 5:12pm

@yjernite thanks very much! I want to fill in any missing gaps. You mention that “We don’t have any models trained exclusively on Amharic” I can see if I can contribute there?

Also when I hear " Language-Agnostic Sentence Encoder" I feel nervous

yjernite · February 19, 2021, 5:39pm

haha yup I don’t know what the downstream quality on Amharic specifically is

So one thing you could do that would be super useful would be to pre-train a BERT-style model for Amharic on OSCAR data, and maybe try fine-tuning it on the am portion of the WikiAnn NER dataset

You can find some instructions in the Language Modeling Tutorial (following the Masked Language Modeling approach)

If you want to train your own tokenizer, you can: it may help you achieve better results. However, the MarianMT models already have a tokenizer that seems to work pretty well for English to Amharic translation - you can load it with:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-am-sv", use_fast=True)

Let us know how that goes and if you run into any trouble training your model!

Also, you should definitely start an introduction thread for Amharic NLP in this forum You can model it after the ones for Arabic and Korean if you need inspiration

yosiasz · February 19, 2021, 5:43pm

Since Amharic is morpho complex language like other Semitic languages, Arabic work has been a great help in moving forward . will keep my eyes on them

Shukran!

yosiasz · February 19, 2021, 9:03pm

File "C:\PYTHON\3.6.8\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 58: character maps to <undefined>

when running

import datasets

oscar_am = datasets.load_dataset("oscar", "unshuffled_deduplicated_am")
print(oscar_am["train"][0])

how to specify encoding UTF-8?

yjernite · February 19, 2021, 9:56pm

Oh I haven’t seen that one before! Could you open an issue in the datasets library? That way we can tag all the relevant people. Is that the full trace or is there more?

yosiasz · February 19, 2021, 9:57pm

looks like a windows issue.
Create an issue

github.com/huggingface/datasets

UnicodeDecodeError: windows 10 machine

opened 10:13PM - 19 Feb 21 UTC

closed 10:40PM - 19 Feb 21 UTC

yosiasz

Windows 10 Php 3.6.8 when running ``` import datasets oscar_am = data…sets.load_dataset("oscar", "unshuffled_deduplicated_am") print(oscar_am["train"][0]) ``` I get the following error ``` file "C:\PYTHON\3.6.8\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 58: character maps to <undefined> ```

yosiasz · February 19, 2021, 10:40pm

upgraded to php 3.9.2 and it works just fine.

israel · February 23, 2021, 10:29pm

@yjernite

had this problem while training Amharic BERT you suggested in this discussion.

yjernite · February 23, 2021, 10:38pm

Hi @israel !

I’d suggest posting in the main forum so we can help you debug this.

There isn’t quite enough information in this screenshot, unfortunately: be sure to post the full notebook you’re running (e.g. as a Colab)!

israel · February 23, 2021, 10:42pm

yosiasz · February 27, 2021, 11:14pm

This WikiAnn NER dataset does not have validation file (val.txt) Do I need that to do this

    #datasets = load_dataset("text", data_files={"train": 'data/amner/train'})

yosiasz · February 27, 2021, 11:38pm

all good now

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=lm_datasets["train"],
        #eval_dataset=lm_datasets["validation"],
        eval_dataset=None,
        data_collator=data_collator,
    )

Topic		Replies	Views
Amharic NLP - Introductions Languages at Hugging Face	5	890	February 24, 2021
Amharic NLP - Train BERT-style model Models	3	347	March 1, 2021
Habesha BERT Amharic Model cards	0	1696	March 5, 2021
Amharic BERT Training Beginners	2	471	February 23, 2021
Hebrew NLP Introduction Languages at Hugging Face	9	3876	September 5, 2023

Amharic NLP: Newbie where do I start

Related topics