Very low GPU usage when translating text, datasets not helping

Hi! Hugging Face blew my mind, it’s awesome, but I’m struggling to get better performance using my 1080ti: it is very low, at 3%, with CPU at around 30%.

At first I got the “UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset” warning so I switched to using a dataset. Now I get no warning but nothing seems to have changed. I guess I’m doing something wrong… here’s some sample code:

to_translate = [# Several Arabic Sentences]
dataset = Dataset.from_pandas(pd.DataFrame(to_translate ))

translator = pipeline(
    'translation_ar_to_en',
    model='Helsinki-NLP/opus-mt-ar-en',
    device=0
)

def trans(ds):
    ds['TRANSLATED'] = translator(ds['0'])
    return ds

for out in tqdm(translator(KeyDataset(dataset, "0"))):
    print(out)

I think I didn’t get the dataset part right yet…
Thanks!

1 Like

Hi Iván! I think you need to use the batch_size= argument when you call your pipeline() in order to process more text sequences at the same time. Something like this should work:

for out in tqdm(translator(KeyDataset(dataset, "0"), batch_size=32)):
    print(out)

You can experiment with different batch sizes to see which one gives you a better performance.
I hope this helps!

Thank you! We’re now implementing on a CPU-only server, but I’ll keep your reply in our backlog :slight_smile:

Hi everyone!
I do have a similar problem, working with a TranslationPipeline:
I have a Pandas-DataFrame with a collection of German texts and their English translation. Now I wanna backtranslate the English column with t5-large (if there is any more recommended model, feel free to tell me so :+1:).

This is the code-snippet:

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained('t5-large', model_max_length=1024)
model = AutoModelForSeq2SeqLM.from_pretrained('t5-large')

translator = pipeline(
    'translation_en_to_de',
    model=model, 
    tokenizer=tokenizer,
    device=0,
    batch_size=64,
)

# df is the DataFrame with columns ['de', 'en']
df['backtranslation'] = df['en'].progress_apply(translator)

Of course the backtranslation with around 600’000 rows takes quite some time.
But there is the UserWarning: You seem to be using the pipelines sequentially on GPU. as well and I can translate with about 3.5 it/s (with a RTX 2070 Super). So I think & hope I could do better here :slight_smile:

I also tried putting the DataFrame in a dataset, but there it got even slower with:

dataset.map(
    lambda row: {'backtranslation': translator(row['en']}, 
    batched=True, 
    batch_size=64
)

Thanks for any hint/idea in advance!
Cheers