TrOCR - inference on images in parallel

I use TrOCR for text recognition for administrative document.
Each line of the text is segmented and converted to an image for inference.
I implement a process to run inference on this batch of images in parallel for speed execution :

    processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-printed')	
    device = torch.device('cuda:0' if torch.cuda.is_available else 'cpu')
    model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-printed').to(device)

    images = []
    for img_file in image_files:
        images.append(img_file)

    futures = [unitary_ocr_trocr.remote(img_path, processor, model, device) for img_path in images]
    ray.get(futures)

where unitary_ocr_trocr() is:

@ray.remote
def unitary_ocr_trocr(image_path, processor, model, device):
    generated_text = ""
    try:
        img=cv2.imread(image_path)
        image=cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

        pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
        generated_ids = model.generate(pixel_values)
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    except Exception as e:
        print(str(e))

    return generated_text

The code is run over a GPU and freezes without any particular message.
My questions:

  • Is it possible to implement such a solution
  • Is the model thread safe
  • Is it possible to submit a batch instead of unitary images in the processor as I have seen there is an argument images
    Thank you in advance for your help and support

Hi,

Batched generation could be done by creating pixel values of shape (batch_size, num_channels, height, width) which are passed to the model. In that case, you would need to pass a list of images to the processor:

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-printed')	
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-printed').to(device)

# pass a list of images to be prepared for the model
pixel_values = processor(images=[image1, image2], return_tensors="pt").pixel_values.to(device)

# next, do batched generation
generated_ids = model.generate(pixel_values)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
1 Like

Thank you for taking time to reply.
I’ll have a go with this suggestion and push feedback.

Hi,
I have implemented the solution, which works ok.
But the overall duration is more or less similar to the sum of inference over each image. I expected to have a parallelized inference.
On top of this if I integrate the code on a loop, after running the code on a first bunch of images the second run fails as the CUDA memory is not freed. Is their a specific command for this ?
Thank so much for your reply.

Just wanted to share some benchmarking results that might help others running into slow inference with TrOCR, using batch_size of 640 images in this case:

#bf 16
def run_inference_batch_bf_16(images, processor, model):
    with torch.autocast(device_type=“cuda”, dtype=torch.bfloat16):
        pixel_values = processor(images=images, return_tensors=“pt”).pixel_values.to(device)
        # next, do batched generation
        generated_ids = model.generate(pixel_values)
        generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
        #print(generated_texts)
#simple batch
def run_inference_batch(images, processor, model):
    pixel_values = processor(images=images, return_tensors=“pt”).pixel_values.to(device)
    # next, do batched generation
    generated_ids = model.generate(pixel_values)
    generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
    #print(generated_texts)
#single inference
def run_inference_single(images, processor, model):
    for image in images:
        pixel_values = processor(images=image, return_tensors=“pt”).pixel_values.to(device)
        generated_ids = model.generate(pixel_values)
        generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
        #print(generated_texts)

_____________________________________________________________________

Benchmarking single image inference:
Total time for 10 runs: 87.166s
Avg time per image: 0.0136s

Benchmarking regular batch inference:
Total time for 10 runs: 33.562s
Avg time per image: 0.0052s
Speedup over single image: 2.60x

Benchmarking bfloat16 batch inference:
Total time for 10 runs: 29.512s
Avg time per image: 0.0046s
Speedup over single image: 2.95x
Extra speedup with bfloat16 over regular batch: 1.14x

1 Like