How to get a model on patent data for question answering

Dear list,

I want to have a question answering model for US patent text. For example, I want to ask it to read a patent’s text and ask questions such as ‘what is the specific problem to solve in this text?’. I tried with some general question answering models such as ‘distilbert-base-cased-distilled-squad’ but the answers were not satisfactory.

Now I am considering if I can get a better model through fine-tuning the model with patent data. So, I wonder if this is the right approach and if it is, then how can I fine-tune a model with patent data so that I can get more satisfactory answers?

Thanks in advance.

You’re going to have to finetune as you said, luckily you can finetune Squad pretty easily. See here:

In a Python Notebook, import your data into a Pandas dataframe and export the table so that it matches the schema of the Squad dataset (see https://huggingface.co/datasets/viewer/?dataset=squad ). In this case, you need 5 fields in your exported file: id, title, context, question and answers.

Once you’ve formatted your data to the schema and exported the JSON/CSV locally, run the run_qa.py file and pass the train and test/validation files like so;

python run_qa.py \
  --model_name_or_path bert-base-uncased \
  --train_file=train-v1.1.json \
  --validation_file=dev-v1.1.json

And of course pass any other (hyper)parameters that you have for your finetuning task.