Training a model for a PDF with OCR - where to begin?

Total newbie here when it comes to ML etc.

I have a pdf with pages that look like this which I can export to jpegs:

I want to train my model to be able to get the:

  1. Question number
  2. The question linked to the number
  3. The number of marks linked to that question
  4. Any diagrams linked to the question
  5. Any answer spaces linked to the question

I’m having a go at using Label Studio to label the areas. Then train with Tensorflow? Is this correct first steps?

Once I labelled them, how do I know that it also extract or taking into account the actual text or content - not just ‘how it looks like’?

Extra: Once trained, how can I integrate my model into embeddings(?) so that I can use LLM (GPT etc) to query/chat bot etc?

Would appreciate any help or how you would approach this?

Many thanks in advance.

S

You may want to consider using an existing library.

This article gives a nice intro to the topic.

Google also recently released a model pix2struct.

1 Like

This is massively helpful to get me started! I’ll take a look at Donut & pix2struct!

Thank you!

Any updates on what was chosen? @syedkhairi @wsfung2008

1 Like