Building a custom Squad 2.0 style dataset, is it worth it?

swayson · July 20, 2020, 3:04pm

Was wondering what the experts think and whether this is a sensible approach. The pre-trained Squad 2.0 models perform well in a custom domain, but can be greatly improved, given the target domain is rather narrow and the vocabulary is different but there is overlap.

Do you think it is worth obtaining a custom dataset, say 1000 observations, using the same methodology as Squad v2.0 but derived from data of the target domain?
Is 1000 observation enough for the fine-tuning?

valhalla · July 20, 2020, 3:30pm

Hi @swayson, not an expert here but fine-tuning on your domain should give better results. I can’t comment on if 1000 examples will e enough or not, you’ll probably need to experiment.

Also have look at this question generation models. You can try to create synthetic QA corpora using these models. Synthetic QA corpora has shown to improve results for QA.

swayson · July 20, 2020, 3:50pm

Thank you @valhalla ; I am going to give the synthetic QA models a shot and see if I can get some improvements.

valhalla · July 20, 2020, 4:06pm

Here’s a relevant paper. See table 2 for Synthetic QA results.

Topic		Replies	Views
Fine tunning QA model in SQUAD 2 dataset with more than one answer Intermediate	2	880	November 6, 2024
Cannot fine-tune RobertaForQA on SQuAD-like dataset? Beginners	0	273	November 15, 2021
Question answering bot: fine-tuning with custom dataset Beginners	6	6018	June 23, 2022
T5 finetuning metrics not improving 🤗Transformers	1	341	June 20, 2023
How to identity a QA model and fine tune it with custom data? Beginners	0	98	April 12, 2024

Building a custom Squad 2.0 style dataset, is it worth it?

Related topics