Multi-lang non-OCR PDF text recognition

Hello everybody,

I’m a total beginner and I’m starting to experiment with the models.
I have this real-world task that I would like to solve:

I receive this multiple pages PDF file (non-OCR) collecting newspapers articles of interest. They are in multiple languages as well (german, spanish, french, english, dutch). The format of the articles varies as well (an example attached).

I would like to set up a pipeline to:

  1. Recognize the text of the article in the different languages
  2. Store the recognized text of each article
  3. Perform a text review to fix grammar and syntax
  4. Translate these texts in italian
  5. Produce a .docx/.pdf file as an output

From which model do you think I could start? Do you think the Colab environment should be fine for this kind of task?

Thank you!!