I’m a total beginner and I’m starting to experiment with the models.
I have this real-world task that I would like to solve:
I receive this multiple pages PDF file (non-OCR) collecting newspapers articles of interest. They are in multiple languages as well (german, spanish, french, english, dutch). The format of the articles varies as well (an example attached).
I would like to set up a pipeline to:
- Recognize the text of the article in the different languages
- Store the recognized text of each article
- Perform a text review to fix grammar and syntax
- Translate these texts in italian
- Produce a .docx/.pdf file as an output
From which model do you think I could start? Do you think the Colab environment should be fine for this kind of task?