Muti-Task Model - OCR + Object Detection

Hello Everyone,

I’m new to Transformers and HuggingFace ecosystem in general.

I need some guidance with a project as part of my studies consisting of creating a single model that can handle 2 tasks related to document processing. It takes as input an image containing handwritten text and signatures and stamps. the objective is to 1. detect the existance of a signature and a stamp in the image ( and then extract them by defining bounding boxes around them) and 2. extract the handwritten text.

I thought model architectures like TrOCR and LayoutLM might help.

Any suggestions on how to build such model , or any scientific papers/blogs that might orient me to the correct direction ?

Many Thanks,

Cheers !