Africa has over 2,000 spoken languages, but these languages are massively underrepresented in NLP research and datasets. The goal of this project is to train strong models for the MasakhaNER corpus, which is a high quality dataset for named entity recognition in 10 African languages.


There are a few popular multilingual models that you can start with:



It is unlikely that all ten languages in MasakhaNER are represented in multiingual models like XLM-R or mBERT, so some decisions will be need to be made on which subsets to focus on.

Desired project outcomes

  • Create a Streamlit or Gradio app on :hugs: Spaces that can take text from one or more of the languages in MasakhaNER and extract the person names (PER), organizations (ORG), locations (LOC) and dates & time (DATE) tags.
  • Don’t forget to push all your models and datasets to the Hub so others can build on them!

Additional resources

