Chat with a PDF

sasadangelo · September 7, 2023, 3:54pm

Hi All,

I am new forum member. Recently, I have interest in AI, machine learning and stuff like this. I studied a documents and tutorials around the web. I am also following the Hugging Faces course on the platform. I completed section 1 and I started to do some experiments. At the moment, I consider myself an absolute beginner.

The first one I attempt is a small Chatbot for a PDF. Basically, you give a PDF to the ChatBOT, then you can start ask questions about it.

M project is here.

I took inspiration from similar projects on the web. However, all these projects use ChatGPT, and I dont’ want to do it because I don’t have free credit anymore and I don’t want to spend money for this kind of experiments.

The README.md contains the procedure to install and run it.

How does it work?

The main program is app.py, here how it works:

First of all I read the PDF. The github project includes a PDF about Robinson Crusoe.
I split it in 1000 characters chunks.
Then I convert them in ids (my understanding is that models only operates with ids tensors, not text) and store them in a chromedb using the all-MiniLM-L6-v2 model.
There is a ChatBOT loop.
User ask the question and I use it to retrieve the top-k documents relative to it. I will use them as context.
Then these lines use the context and the question to generate a response using the google/flan-t5-large model.

The problem is that the chatbot is not very exciting. I mean, I tested the code and it seems that PDF is correctly loaded, correctly splitted in chunks, probably correctly saved in chromedb (I don’t know how to verify this). The real problem is that the quality of the answer is very poor. Here an example of output:

Ask Question about your PDF: who is robinson crusoe?
a teenager
Ask Question about your PDF: can you tell me more about him?
Yes
Ask Question about your PDF: please tell me more about him
He is a good student. He is clever. He improves quickly.
Ask Question about your PDF: can you tell me the member of his family?
Father
Ask Question about your PDF: can you tell me a bit of robinson crusoe story
Yes

As you can notice the first answer is too short. The second answer is even worse. I would like more complete answer (something similar to ChatGPT). Moreover, it seems that it doesn’t generate answer but simply use the text in the PDF. I would like that the chatbot provide an answer by its own using only the top-k chunks as a context to generate the response.

Since I am a beginner, I am not familiar with the model to use for text generation in my specific scenario. Can anyone help me to figure out what’s wrong with my code?

Another thing I don’t understand is that to run the HuggingFaceHub class, I needed to create an Hugging Face Token. Can you explain to me why? How this service is billed and what are the rate limit. I tried to understand it from the platform but no luck.

Another improvement that I want to do is to persist my chroma database on disk so that I can register my chunk ids there once and then use it N time simply loading the content from there. I tried several code pieces, read doc, but I didn’t find a way to do it.

Thank you in advance for your help.

siddhant-kumar · September 8, 2023, 6:18pm

Thanks for the detailed explanation@sasadangelo. I had a couple of suggestions which might improve the retrived answer:
Embedding model used for converting the chunks into embeddings.
The number of chunks we generate also matters a lot
LLM models used for retrieval purposes

I’m happy to discuss more around this and look into the potential opportunities for improvisation. Thanks!

vpkprasanna · September 11, 2023, 1:01pm

Thanks for the explanation .

Try for different embeddings models
Try different LLM model as well .
I would suggest you to try faiss embeddings rather than chroma since faiss is very good at finding the similar text .

sasadangelo · September 13, 2023, 7:59am

Hi All,

Thank your for your reply. I analyzed a bit the issue and I found that:

FAISS and Chromedb are quiet similar in this scenario, the extracted docs are enough to have a good answer.
I noticed that before split the PDF in chunks it should be cleaned up in some way, there is a lot of rubbish (index, first introduction pages, and so on) but I do not know how to define a generic “cleanup procedure” that could be valid for each PDF.
The size of the chunnks matters. If I set a chunks of 1000 with 200 overlap the extracted docs are quiet good. If I set 200 with 50 overlap they are very bad.
The reason in 3 I tried 200/50 is that I tried with gpt2 LLM and if the chunk is larger than 200 it give me the error “ValueError: Error raised by inference API: Input is too long for this model, shorten your input or use ‘parameters’: {‘truncation’: ‘only_first’} to run the model only on the first part.”

Since I am quite inexpert at LLM level, you told me:
" * Try different LLM model as well ."

can you suggest me one or two I can try that accepts 1000 size chunks and provide good answer in human like manner?

In general, I see people on the web always use ChatGPT and this has a cost, moreover not all the organizations allow the use of ChatGPT (the mine, for example, don’t allow it).

sasadangelo · September 13, 2023, 9:08am

I did small progresses.
Consider you have the chatbot in a streamlit interface where you can upload the PDF. You can do there 2 things to improve the PDF quality:

insert in a text box the list of pages to exclude
insert in a text area the list of lines to exclude from the PDF

I simulated this with this code just for demo purpose:

github.com

sasadangelo/chatpdf/blob/caa57d0266a36e8c047acf65c7dac61f5d58099b/app.py#L11-L27


      
          # Page to skip
          pages_to_skip = [2, 3]
          # Example: text to exclude
          text_to_exclude = [
              "Learn English through story",
              "English Short Stories for Beginners",
              "By: learnenglish-new.com",
              "By: learnenglish -new.com",
              "The source of the story: newsinlevels.com",
              "Brought the story from: learnenglish-new.com",
              "If you want to read this book online: newsinlevels.com",
              "If you want to download the book: learnenglish-new.com",
              "Robinson Crusoe level-1",
              "Robinson Crusoe - Level 1",
              "Brought the story from: learnenglish -new.com",
              "If you want to download the book:    learnenglish -new.com"
          ]

Here the code that filter pages and unwanted text:

github.com

sasadangelo/chatpdf/blob/caa57d0266a36e8c047acf65c7dac61f5d58099b/app.py#L33-L45


      
          for page_number, page in enumerate(reader.pages, start=1):
              if page_number in pages_to_skip:
                  continue  # Skip the page if it is pages_to_skip list
              #text += page.extract_text()
              page_text = page.extract_text()
              # Verify if text is not empty o composed by only spaces
              if page_text.strip():
                  # Exclude undesired lines from the text
                  for line_to_exclude in text_to_exclude:
                      page_text = page_text.replace(line_to_exclude, "")
                  
                  # Add the filtered page text to the variable 'text'
                  text += page_text

Line 47 remove the extra blank spaces.

I verified that input chunks (before embeddings) are really good.

I verified that FAISS and Chromedb extract the same documents so change them doesn’t bring any improvement. I think that they are only vector database, what really matters is the model used for the embedding (see line 59).

I don’t know if changing it can bring to improvements.

However, I anaalyzed a bit different models for answer generation. My first doubt was: should I use TextGeneration or Text2Text generation models? After a bit of analysis I think the second one is the option to choose.

I selected meta-llama/Llama-2-7b that seems quite promising. I did some tests in live chat and the result is amazing. Now I can use it in two way:

on local
on HuggingFace HUB

The second option seems more easy because I only need to change line 73 with 74 (see my new code).

I asked authorization on META Website and I am waiting the approval on HuggingFaces HUB. Is there a way to acceleraate the approval?

sasadangelo · September 14, 2023, 5:50am

Just a new update,

Tried meta-llama/Llama-2-7b but no luck. If I use this model I got the error:
“Error raised by inference API: meta-llama/Llama-2-7b does not appear to have a file named config.json”

Looking for solution on web I found I need to use meta-llama/Llama-2-7b-hf but then I got:
“Model requires a Pro subscription; check out Hugging Face – Pricing to learn more”

It requires a pro subscription (like ChatGPT). I am going around in circle. Any suggestion?

alok10 · February 13, 2024, 7:20am

Good work with that Sasadangelo. We also made a chatpdf tool. Be sure to give your feedback

yosuaw · March 13, 2024, 7:02am

You can use Llama-2 quantised model provided by TheBloke. For GPU use case, you can choose GPTQ model provided in here. If you choose to run it on CPU only or CPU + GPU, you can choose GGUF quantised model in here.

To use it, you can refer to the documentation in that link. Also note that I suggest you to follow the Llama-2 prompt template (also provided in the model card) for the best answer generation.

Topic		Replies	Views
Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface 🤗Transformers	2	2624	November 5, 2024
Seeking AI for Creating Interactive Quizzes from PDFs Beginners	0	45	July 20, 2024
Chatting with pdf (with reasoning capabilities) Beginners	2	197	February 4, 2025
Chatbot PDF - Only local Intermediate	1	1339	April 21, 2024
Chatbot PDF - using flan-t5-large model Models	0	81	December 20, 2024

Chat with a PDF

Related topics