How do i fine-tune Domain-specific word similarity models

I am trying to create a chat boat like application (inspired by chat GPT). The boat or you can say an application should be able to answer questions of a software/products respective Help document.
I have tried to finetune tilbert_base_uncased model from hugging face on less than 100 annotated Question-answer in the form of squad format. but my model is not performing well. the F1 score is about 0.3. Can anyone suggest important approaches or docs related to Question answering-based QA implementation who worked on the same problem?