Dear list,
I want to have a question answering model for US patent text. For example, I want to ask it to read a patent’s text and ask questions such as ‘what is the specific problem to solve in this text?’. I tried with some general question answering models such as ‘distilbert-base-cased-distilled-squad’ but the answers were not satisfactory.
Now I am considering if I can get a better model through fine-tuning the model with patent data. So, I wonder if this is the right approach and if it is, then how can I fine-tune a model with patent data so that I can get more satisfactory answers?
Thanks in advance.
You’re going to have to finetune as you said, luckily you can finetune Squad pretty easily. See here:
In a Python Notebook, import your data into a Pandas dataframe and export the table so that it matches the schema of the Squad dataset (see https://huggingface.co/datasets/viewer/?dataset=squad ). In this case, you need 5 fields in your exported file: id, title, context, question and answers.
Once you’ve formatted your data to the schema and exported the JSON/CSV locally, run the run_qa.py
file and pass the train and test/validation files like so;
python run_qa.py \
--model_name_or_path bert-base-uncased \
--train_file=train-v1.1.json \
--validation_file=dev-v1.1.json
And of course pass any other (hyper)parameters that you have for your finetuning task.