@israel Here is a thread where we can collaborate on work to pre-train a BERT-style model for Amharic on OSCAR data.
One thing I have noticed on a lot of NLP efforts is it has a high barrier to entry. I believe the documentation needs to be so clear that anyone (with minimum data science knowledge) coming after us has to be able to implement the process with easy to follow step by step instructions.
Are you talking about the Colab you shared in another thread? Did you run the cells in order? Looks like you havenât run tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) yet