Build a question answering system in your own language

Hey @Endre,

Regarding (1), yes you’re right that these datasets aren’t in the SQuAD format, so what you’d want to do is either:

  • Use an existing pretrained model in Hungarian or Romanian (or a multilingual model) to generate embeddings for all the answers, and then compute the similarity between a query and all the answers. (See this nice description using sentence-transformers). This could allow someone to enter their question and then you return the top-N most likely answer documents.
  • Train your own sentence transformer on the Hungarian / Romanian subsets (see example here). This is more complex, so maybe it’s best to get a Space running with something like the above first :slight_smile:

Regarding (2), you’re totally right and this is an oversight on my part! I’ll re-word the project description to be more focused on training a QA system in one’s language - creating the dataset would indeed require a lot of human evaluation to update the character indices of the answers. Thank you for pointing this out to me!