Please read the topic category description to understand what this is all about
Description
One of the major challenges with NLP today is the lack of systems for the thousands of non-English languages in the world. In this project, the goal is to build a question answering system in your own language. There are two main approaches you can take:
Find a SQuAD-like dataset in your language (these tend to only exist for a few languages unfortunately)
Find a dataset of question / answer pairs and build a search engine that returns the most likely answers to a given question
Model(s)
For the SQuAD-like task, any BERT-like model in your language would be a good starting point. If such a model doesn’t exist, consider one of the multilingual Transformers like mBERT or XLM-RoBERTa.
This is a somewhat complex project because it involves both training multilingual models on potentially low-resource languages.
Desired project outcomes
Create a Streamlit or Gradio app on Spaces that allows users to obtain answers from a snippet of text in your own language, or returns the top-N documents the might contain the answer.
Don’t forget to push all your models and datasets to the Hub so others can build on them!
OK I think a German version of SQuAD v1 already exists in the xquaddataset and there’s also a custom QA dataset called germanquad.
For French there’s a custom QA dataset called fquad. So for these languages, it would be really cool to focus on fine-tuning a German / French model one of these datasets or using a multilingual model like XLM-RoBERTa that can answer questions in both at once.
What do you think about doing something like that?
This sounds like a really interesting project!
I would like to take a shot at this, in either hungarian (my native language) or romanian (which I also speak at a good level).
Hey @Endre cool to hear that you’re interested in this project!
Given that it might be time consuming to translate all of SQuAD into Hungarian or Romanian, it might make sense to first start by training a model on an existing dataset in one of those languages.
For example, there is the mqa dataset which is a different type of question answering called “community question answering”. It has subsets in both your languages and this way you can get a model trained / Space up and running faster than creating the dataset from scratch.
Community QA is more of a retrieval based approach and you can find an example of what it involves here with the haystack library (based on transformers).
Of course you’re welcome to create your own SQuAD dataset, but thought I should provide an alternative just in case
I have taken a look and tried to make some progress, but it seems I’m a bit stuck on both approaches.
1. Using mqa multilingual dataset approach
My main problem here is that this dataset has a totally different format than SQuAD, so I don’t know what type of model to fine-tune.
I’ve downloaded the relevant subsets and investigating them. The CQA part looks like forum questions with multiple answers and looks to have lower quality. The FAQ contains single questions and answer, where the quality looks good, however there is no context for these (as in the case of SQuAD). So I cannot use this to fine-tune a ModelForQuestionAnswering, right?
I gues the difference in the dataset has to do with the fact, that you mentioned, that the mqa data set is of a retrieval based/community question answering type (in contrast with SQuAD which is extractive question answering type). The haystack example you’ve linked talks about creating an embedding of the questions, and then calculating a similarity with the input/incoming user questions. Is this the path to go forward?
2. Translating SQuAD dataset approach
I don’t know what to do with the answer_start property
I’ve checked the resourses linked for this approach as well, and while translating the context/question/answer automatically with some model seems feasable to me, there is on part of the Dataset which seems problematic.
Every answer has an answer_start property, which if my assumption is correct, marks the start of the tokenized answer in the tokenized context. Now, in case the text is translated, word order will be switched and the answer_start properies will be all wrong/offset. Would this prevent the correct training of a model? How could this be solved?
Regarding (1), yes you’re right that these datasets aren’t in the SQuAD format, so what you’d want to do is either:
Use an existing pretrained model in Hungarian or Romanian (or a multilingual model) to generate embeddings for all the answers, and then compute the similarity between a query and all the answers. (See this nice description using sentence-transformers). This could allow someone to enter their question and then you return the top-N most likely answer documents.
Train your own sentence transformer on the Hungarian / Romanian subsets (see example here). This is more complex, so maybe it’s best to get a Space running with something like the above first
Regarding (2), you’re totally right and this is an oversight on my part! I’ll re-word the project description to be more focused on training a QA system in one’s language - creating the dataset would indeed require a lot of human evaluation to update the character indices of the answers. Thank you for pointing this out to me!
Based on your pointers, I went ahead and created a kind of semantic search in hungarian! Finally, as a dataset I have used shortened abstracts from wikipedia and calculated the embeddings using a pretrained multilanguage sentence-transformer.
Hello to everyone. I want to build a question answering model in Greek language in order to help my students in mathematics. My pro lem is that for the greek language I have only found a greek masked language model which has been trained in wikipedia with no knowledge of mathematical terms. Can someone please tell me what steps I should follow in order to accomplice my goal? Two years now I am trying with no luck. The way I have thought about it is that I will have some texts with mathematical definitions and methodologies and when a student asks for example “how can I solve a first order equation " or " I need help in equations” the model will select the most relevant text.
Should I use the greek masked language model I have found on hugging face and train it more on greek mathematical context and after that train it on the question answering task? The multilingual models I have found for the question answering task dont seem to work very well on greek mathematical context. Any help or guidance would save me a lot of time. Thanks in advance and excuse me if my post is out of topic
Great job @lewtun I want to create question and answering in my language Tigriyna from scratch cause most of multilingual models not incorporated, which way can you suggest me?