Please read the topic category description to understand what this is all about
Description
Africa has over 2,000 spoken languages, but these languages are massively underrepresented in NLP research and datasets. The goal of this project is to train strong models for the MasakhaNER corpus, which is a high quality dataset for named entity recognition in 10 African languages.
Model(s)
There are a few popular multilingual models that you can start with:
Datasets
Challenges
It is unlikely that all ten languages in MasakhaNER are represented in multiingual models like XLM-R or mBERT, so some decisions will be need to be made on which subsets to focus on.
Desired project outcomes
- Create a Streamlit or Gradio app on
Spaces that can take text from one or more of the languages in MasakhaNER and extract the person names (PER), organizations (ORG), locations (LOC) and dates & time (DATE) tags.
- Don’t forget to push all your models and datasets to the Hub so others can build on them!
Additional resources
Discord channel
To chat and organise with other people interested in this project, head over to our Discord and:
- Follow the instructions on the
#join-course
channel - Join the
african-ner
channel
Just make sure you comment here to indicate that you’ll be contributing to this project
Team organization on the Hub
To join this team, make sure you join the following organisation on the Hub: