Please read the topic category description to understand what this is all about
One of the major challenges with NLP today is the lack of systems for the thousands of non-English languages in the world. In this project, the goal is to build a question answering system in your own language. There are two main approaches you can take:
- Find a SQuAD-like dataset in your language (these tend to only exist for a few languages unfortunately)
- Find a dataset of question / answer pairs and build a search engine that returns the most likely answers to a given question
For the SQuAD-like task, any BERT-like model in your language would be a good starting point. If such a model doesn’t exist, consider one of the multilingual Transformers like mBERT or XLM-RoBERTa.
For the search-like task, check out one of the many
sentence-transformers models on the Hub: sentence-transformers (Sentence Transformers)
This is a somewhat complex project because it involves both training multilingual models on potentially low-resource languages.
- Create a Streamlit or Gradio app on Spaces that allows users to obtain answers from a snippet of text in your own language, or returns the top-N documents the might contain the answer.
- Don’t forget to push all your models and datasets to the Hub so others can build on them!
To chat and organise with other people interested in this project, head over to our Discord and:
Follow the instructions on the
Just make sure you comment here to indicate that you’ll be contributing to this project