Build a question answering system in your own language

:wave: Please read the topic category description to understand what this is all about


One of the major challenges with NLP today is the lack of systems for the thousands of non-English languages in the world. In this project, the goal is to build a question answering system in your own language. There are two main approaches you can take:

  • Find a SQuAD-like dataset in your language (these tend to only exist for a few languages unfortunately)
  • Find a dataset of question / answer pairs and build a search engine that returns the most likely answers to a given question


For the SQuAD-like task, any BERT-like model in your language would be a good starting point. If such a model doesn’t exist, consider one of the multilingual Transformers like mBERT or XLM-RoBERTa.

For the search-like task, check out one of the many sentence-transformers models on the Hub: sentence-transformers (Sentence Transformers)



This is a somewhat complex project because it involves both training multilingual models on potentially low-resource languages.

Desired project outcomes

  • Create a Streamlit or Gradio app on :hugs: Spaces that allows users to obtain answers from a snippet of text in your own language, or returns the top-N documents the might contain the answer.
  • Don’t forget to push all your models and datasets to the Hub so others can build on them!

Additional resources

Discord channel

To chat and organise with other people interested in this project, head over to our Discord and:

  • Follow the instructions on the #join-course channel

  • Join the #question-answering-de-fr channel

Just make sure you comment here to indicate that you’ll be contributing to this project :slight_smile:

1 Like

Hi @lewtun would be glad to work on that project :slight_smile: Thanks

1 Like

Hey @jpabbuehl happy to hear that you’d like to tackle this valuable project! Out of curiosity, which language did you have in mind?

Either French or German ?

OK I think a German version of SQuAD v1 already exists in the xquad dataset and there’s also a custom QA dataset called germanquad.

For French there’s a custom QA dataset called fquad. So for these languages, it would be really cool to focus on fine-tuning a German / French model one of these datasets or using a multilingual model like XLM-RoBERTa that can answer questions in both at once.

What do you think about doing something like that?

This sounds like a really interesting project!
I would like to take a shot at this, in either hungarian (my native language) or romanian (which I also speak at a good level).

Hey @Endre cool to hear that you’re interested in this project!

Given that it might be time consuming to translate all of SQuAD into Hungarian or Romanian, it might make sense to first start by training a model on an existing dataset in one of those languages.

For example, there is the mqa dataset which is a different type of question answering called “community question answering”. It has subsets in both your languages and this way you can get a model trained / Space up and running faster than creating the dataset from scratch.

Community QA is more of a retrieval based approach and you can find an example of what it involves here with the haystack library (based on transformers).

Of course you’re welcome to create your own SQuAD dataset, but thought I should provide an alternative just in case :slight_smile:

thanks for the tips @lewtun Will give a shot as you suggested

1 Like

Cool! I’ve created a channel on Discord (see topic description) in case you and others want to chat there :slight_smile:

Thanks a lot for the pointer @lewtun!

I have taken a look and tried to make some progress, but it seems I’m a bit stuck on both approaches.

1. Using mqa multilingual dataset approach

My main problem here is that this dataset has a totally different format than SQuAD, so I don’t know what type of model to fine-tune.

I’ve downloaded the relevant subsets and investigating them. The CQA part looks like forum questions with multiple answers and looks to have lower quality. The FAQ contains single questions and answer, where the quality looks good, however there is no context for these (as in the case of SQuAD). So I cannot use this to fine-tune a ModelForQuestionAnswering, right?

I gues the difference in the dataset has to do with the fact, that you mentioned, that the mqa data set is of a retrieval based/community question answering type (in contrast with SQuAD which is extractive question answering type). The haystack example you’ve linked talks about creating an embedding of the questions, and then calculating a similarity with the input/incoming user questions. Is this the path to go forward?

2. Translating SQuAD dataset approach

I don’t know what to do with the answer_start property

I’ve checked the resourses linked for this approach as well, and while translating the context/question/answer automatically with some model seems feasable to me, there is on part of the Dataset which seems problematic.
Every answer has an answer_start property, which if my assumption is correct, marks the start of the tokenized answer in the tokenized context. Now, in case the text is translated, word order will be switched and the answer_start properies will be all wrong/offset. Would this prevent the correct training of a model? How could this be solved?

Hey @Endre,

Regarding (1), yes you’re right that these datasets aren’t in the SQuAD format, so what you’d want to do is either:

  • Use an existing pretrained model in Hungarian or Romanian (or a multilingual model) to generate embeddings for all the answers, and then compute the similarity between a query and all the answers. (See this nice description using sentence-transformers). This could allow someone to enter their question and then you return the top-N most likely answer documents.
  • Train your own sentence transformer on the Hungarian / Romanian subsets (see example here). This is more complex, so maybe it’s best to get a Space running with something like the above first :slight_smile:

Regarding (2), you’re totally right and this is an oversight on my part! I’ll re-word the project description to be more focused on training a QA system in one’s language - creating the dataset would indeed require a lot of human evaluation to update the character indices of the answers. Thank you for pointing this out to me!

@lewtun thanks for the tips, again!

Based on your pointers, I went ahead and created a kind of semantic search in hungarian! Finally, as a dataset I have used shortened abstracts from wikipedia and calculated the embeddings using a pretrained multilanguage sentence-transformer.

My space is up and running, and the returned results for input queries are more or less relevant! :slight_smile:

It ain’t much, but it’s honest work and it was an interesting project to research and execute!


1 Like

Hey @Endre great job on building this search engine! I gave it a test and it indeed gives back relevant results :slight_smile:

1 Like

Hello to everyone. I want to build a question answering model in Greek language in order to help my students in mathematics. My pro lem is that for the greek language I have only found a greek masked language model which has been trained in wikipedia with no knowledge of mathematical terms. Can someone please tell me what steps I should follow in order to accomplice my goal? Two years now I am trying with no luck. The way I have thought about it is that I will have some texts with mathematical definitions and methodologies and when a student asks for example “how can I solve a first order equation " or " I need help in equations” the model will select the most relevant text.
Should I use the greek masked language model I have found on hugging face and train it more on greek mathematical context and after that train it on the question answering task? The multilingual models I have found for the question answering task dont seem to work very well on greek mathematical context. Any help or guidance would save me a lot of time. Thanks in advance and excuse me if my post is out of topic