If you want to treat a PDF as text, you can simply use a Python library to extract the text data, clean it up, and use it for fine-tuning.
On the other hand, if you want to treat PDFs as images that contain both text and layout, it becomes more complicated, and it is more in the realm of VLM or multimodal models than LLM. In this case, you can either convert the PDF to an image first, or use a more complicated method.
Also, if you want to have a chatbot accurately interpret PDFs, it is probably easier in the end to use a system called RAG. Find a method that seems to fit your use case. I think it’s a good idea to try out various finished products in Spaces first.