TrOCR - inference on images in parallel

I use TrOCR for text recognition for administrative document.
Each line of the text is segmented and converted to an image for inference.
I implement a process to run inference on this batch of images in parallel for speed execution :

    processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-printed')	
    device = torch.device('cuda:0' if torch.cuda.is_available else 'cpu')
    model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-printed').to(device)

    images = []
    for img_file in image_files:
        images.append(img_file)

    futures = [unitary_ocr_trocr.remote(img_path, processor, model, device) for img_path in images]
    ray.get(futures)

where unitary_ocr_trocr() is:

@ray.remote
def unitary_ocr_trocr(image_path, processor, model, device):
    generated_text = ""
    try:
        img=cv2.imread(image_path)
        image=cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

        pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
        generated_ids = model.generate(pixel_values)
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    except Exception as e:
        print(str(e))

    return generated_text

The code is run over a GPU and freezes without any particular message.
My questions:

  • Is it possible to implement such a solution
  • Is the model thread safe
  • Is it possible to submit a batch instead of unitary images in the processor as I have seen there is an argument images
    Thank you in advance for your help and support

Hi,

Batched generation could be done by creating pixel values of shape (batch_size, num_channels, height, width) which are passed to the model. In that case, you would need to pass a list of images to the processor:

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-printed')	
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-printed').to(device)

# pass a list of images to be prepared for the model
pixel_values = processor(images=[image1, image2], return_tensors="pt").pixel_values.to(device)

# next, do batched generation
generated_ids = model.generate(pixel_values)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
1 Like

Thank you for taking time to reply.
I’ll have a go with this suggestion and push feedback.

Hi,
I have implemented the solution, which works ok.
But the overall duration is more or less similar to the sum of inference over each image. I expected to have a parallelized inference.
On top of this if I integrate the code on a loop, after running the code on a first bunch of images the second run fails as the CUDA memory is not freed. Is their a specific command for this ?
Thank so much for your reply.