NLP Pretrained model model doesn’t use GPU when making inference

I am using Marian MT Pretrained model for Inference for machine Translation task integrated with a flask Service . I am running the Model on Cuda enabled device .While inferencing the model not using the GPU ,it is using the CPU only .I don’t want to use the cpu for inference as it is taking very long time for processing the request. Even if i am passing 1 sentence it is taking very long . Please help on this . Below is the code snippet and model i am using

model_name = ‘Helsinki-NLP/opus-mt-ROMANCE-en’
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer.prepare_translation_batch(src_text))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

I have downloaded the pytorch model.bin and other tokenizer files from s3 environment and saved on my local …Please help on this how i can put the things on GPU for faster inference

Have you tried model.to(‘cuda’), to make the model use the GPU?

Hi Karthik

Thanks for replying yes i have used please find below the snippet and please correct me where i am doing wrong

torch_device = ‘cuda’ if torch.cuda.is_available() else 'cpu’
print(torch_device)

model_name = ‘Helsinki-NLP/opus-mt-ROMANCE-en’
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name).to(torch_device)
translated = model.generate(**tokenizer.prepare_translation_batch(src_text))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

Thanks in advance

You have the model on GPU but how about the tokenizer?

Change this line:
translated = model.generate(**tokenizer.prepare_translation_batch(src_text))

To:
translated = model.generate(**tokenizer.prepare_translation_batch(src_text).to(‘cuda’))

Sure kartik thanks
i will check also should i also change this below line

tokenizer = MarianTokenizer.from_pretrained(model_name).to(‘cuda’)
or translated = model.generate(**tokenizer.prepare_translation_batch(src_text).to(‘cuda’))
only to the above one which you told

tokenizer = MarianTokenizer.from_pretrained(model_name).to(‘cuda’) - Do you get an error here, like, "‘MarianTokenizer’ object has no attribute ‘to’? If so, you can give as I have mentioned.
Try it out.

hi @Karthik12 i am able to use the gpu ,but inference is very slow and time consuming .Is there a way to make the inference fast .I have the done the same changes as you suggested still the inference is very slow and it takes time to process one request

Hi guys, I am having the same issue, did you figure out what the issue is? The execution is very slow, althugh the model seems to be on a GPU. It used to be different before on GPU, although I don’t recall exactly the transformers version I was using.

Hello, have you managed to solve the issue of slow translation ?

Still getting the same speed. Are you able to manage the faster inference ?

are you able to manage the inference speed on gpu for marian ?