Training a model for a PDF with OCR - where to begin?

syedkhairi · May 22, 2023, 11:29am

Total newbie here when it comes to ML etc.

I have a pdf with pages that look like this which I can export to jpegs:

I want to train my model to be able to get the:

Question number
The question linked to the number
The number of marks linked to that question
Any diagrams linked to the question
Any answer spaces linked to the question

I’m having a go at using Label Studio to label the areas. Then train with Tensorflow? Is this correct first steps?

Once I labelled them, how do I know that it also extract or taking into account the actual text or content - not just ‘how it looks like’?

Extra: Once trained, how can I integrate my model into embeddings(?) so that I can use LLM (GPT etc) to query/chat bot etc?

Would appreciate any help or how you would approach this?

Many thanks in advance.

S

wsfung2008 · May 22, 2023, 1:06pm

You may want to consider using an existing library.

This article gives a nice intro to the topic.

Google also recently released a model pix2struct.

syedkhairi · May 22, 2023, 2:50pm

This is massively helpful to get me started! I’ll take a look at Donut & pix2struct!

Thank you!

ThePie · December 13, 2023, 5:00am

Any updates on what was chosen? @syedkhairi @wsfung2008

prasantapanja · October 27, 2024, 10:26am

Hope you guys are doing well. Do you have any update @syedkhairi @wsfung2008 regarding this use case? Interested to know how you finally cracked upon this as simple storing and retrieving of text embedding doesn’t solve this.

Topic		Replies	Views
Creating Own model for custom data Beginners	1	267	November 5, 2024
How to train a model to extract specific data from PDFs? Beginners	2	2788	January 30, 2025
Can someone point me to docs for how to train my own a model? Models	2	621	January 3, 2023
Good pre-trained models for Document Answering tasks? Beginners	3	4796	February 20, 2024
Google Document AI Alternative 🤗Transformers	3	866	October 6, 2024

Training a model for a PDF with OCR - where to begin?

Related topics