Any Multi Modal LLMs that take direct pdf + text as input?

Does huggingface have any models that take in direct pdfs as input, similar to VLMs that take in an image input?

Our organization is currently working on something that requires taking a prompt and a pdf page as an input, and then converting that page into html text. If the page contains an image then we need to tell the LLM in the prompt that the image needs to be described using the context provided and be encapsulated with tags.

If it is a chapter heading then it needs to add <chapter_heading> tags and so on.

To do this, I need something that takes in a literal pdf as an input and the prompt. Do you guys think such a model exists on huggingface? If so what category will I find it in?

I know that the one way I can do this is, if I use a VLM that takes in an image input along with the prompt, where the image will be an individual pdf page. What do you guys think?

1 Like

It is extremely difficult to find the right model for you from HF’s model search screen. There are over a million repos, even if not all of them are available models.

So if you are wondering if you can find such a model in HF, first try to find it in Spaces with a simple keyword.
Someone may have already made it.
You can greatly shorten the process by diverting the code and models used in Spaces.

1 Like

aha : the next question on my list !!

What we need is the text representation of the pdf ?
PErhaps its some form of encoding ? Then we train the model on pdfs ( ie this encoding as input and the actual text output as expected output ) … hence we can train for it !
So I would create a load of documents in MSWord then Save them as text files and then save them as pdf files :slight_smile:

Using the input//output method !

THis would train the model given the ( txt represetnation of the pdf ) to produce the Text version of the pdf :slight_smile: and obviously vice versa !

Or
Use a tool to convert a set of pdf to … output format maybe ( markdown or html ) then do the same thing !

this way youi can ,make a pathway fro when the model outputs a pdf ( text representation ) you can use the pipelein to decode it !

SO you could just be workiong with the text representation of the pdf ( as this is the tokens )

oi already did this with images in the language model :slight_smile:

so i can give the model a Base64 : of the image ie

here is an image jhkfksjf_BASE64_hksjhfks
describe it ?
or it can give me an image :slight_smile:
here is an image of your dog ljlksdflksdjfl_BASE64_kjhlsd9injs

or even :slight_smile:
What is this sound ,kjdbfjdhkfBASE 64 ( spectrographic representation of a sound )

and now it can also generate these spectrographic images which can be converted back to a sound !

voila! there you go !

alonzi !

2 Likes