Multi-lang non-OCR PDF text recognition

amendolajine · November 12, 2023, 11:17am

Hello everybody,

I’m a total beginner and I’m starting to experiment with the models.
I have this real-world task that I would like to solve:

I receive this multiple pages PDF file (non-OCR) collecting newspapers articles of interest. They are in multiple languages as well (german, spanish, french, english, dutch). The format of the articles varies as well (an example attached).

I would like to set up a pipeline to:

Recognize the text of the article in the different languages
Store the recognized text of each article
Perform a text review to fix grammar and syntax
Translate these texts in italian
Produce a .docx/.pdf file as an output

From which model do you think I could start? Do you think the Colab environment should be fine for this kind of task?

Thank you!!

Topic		Replies	Views
Google Document AI Alternative 🤗Transformers	3	842	October 6, 2024
Training a model for a PDF with OCR - where to begin? Beginners	4	10575	October 27, 2024
Fine-Tuning a Language Model with Data Extracted from Multiple PDFs for a Chat Interface 🤗Transformers	2	2594	November 5, 2024
Gemma 3 - RAG - PDF Models	2	1624	March 20, 2025
Best free options if you want to train a language model on a small set of private documents? Beginners	3	440	April 5, 2024

Multi-lang non-OCR PDF text recognition

Related topics