I’m new to Transformers and HuggingFace ecosystem in general.
I need some guidance with a project as part of my studies consisting of creating a single model that can handle 2 tasks related to document processing. It takes as input an image containing handwritten text and signatures and stamps. the objective is to 1. detect the existance of a signature and a stamp in the image ( and then extract them by defining bounding boxes around them) and 2. extract the handwritten text.
I thought model architectures like TrOCR and LayoutLM might help.
Any suggestions on how to build such model , or any scientific papers/blogs that might orient me to the correct direction ?