What's the best path to train a chatbot on a specific author's books?

Hello all :slight_smile:

I have project where i would like to train a chatbot on one author’s books. It’s a bot that will only be answering question based on the author’s books, so there’s no need for it to know how many people lives in each country and what they eat for breakfast and so many other things.

How big a model is needed, does this have any effect on how it can communicate with the user/asker ?

There might be a problem with the language, but i’m not sure, when it’s specific to only these books. The author is danish, a sort of philosopher, and i know from searching around that in general it is a problem using english trained models to speak a different language because of the differences in how we in different cultures speak and how we use our words ?

Then of course there’s hardware. I have an i5-13600K, 64GB and an RTX 3060 12GB. I know that this will limit me to maybe a 13B, at least from what i have read.

Then there’s the training, where i was advised to go with RAG, but reading back and forth, watching youtube after youtube, i’m not sure ?

The result i want, is that it ends up being a super-search know-it-all chatbot regarding what the author has written :slight_smile:

All the best
Carsten, Denmark

check this one Finetuning Llama 2 and Mistral. A beginner’s guide to finetuning SOTA… | by Geronimo | Nov, 2023 | Medium

1 Like

Sorry for the late reply, but i was needed elsewhere and this is a personal project.

What’s described in that article is creating the “voice” of a figure in the book.

What i’m after is a ChatBot which relays the information in the book when asked about it. I have the books in PDF, so it’s about getting them into the Bot, and keep it from guessing and answering from the text it has been fed from these books.

Can you consider a semantic search model using sentence transformers?

https://www.sbert.net/examples/applications/semantic-search/README.html#question-answer-retrieval

Break your pdf into paragraphs and encode each paragraph. You can also consider using some sort of a sliding window to encode sections of the pdf just in case your answer gets broken up across paragraphs.

Then encode the input question from user in the same space and look for closest match. Present that as the answer or optionally paraphrase using a separate model and present back.