Amharic NLP - Train BERT-style model

@israel Here is a thread where we can collaborate on work to pre-train a BERT-style model for Amharic on OSCAR data.

One thing I have noticed on a lot of NLP efforts is it has a high barrier to entry. I believe the documentation needs to be so clear that anyone (with minimum data science knowledge) coming after us has to be able to implement the process with easy to follow step by step instructions.


Please DM me.


this part has some issues in my jupiter

def tokenize_function(examples):
    return tokenizer(examples["text"])

when called from

tokenized_datasets =, batched=True, 
num_proc=4, remove_columns=["text"])

File “”, line 2, in tokenize_function
NameError: name ‘tokenizer’ is not defined

Are you talking about the Colab you shared in another thread? Did you run the cells in order? Looks like you haven’t run tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) yet