Using MarianModel's in pytorch is too slow to do back translation (not parallelised correctly)

I’m trying to use MarianModels for back translation as data augmentation. However, it’s too slow even using multiple GPUs. and I also can not use a batch size larger than 16 setting the max length to 300 though. Indeed it takes one day to complete half an epoch.
following is the code I’m using

target_langs = ['fr,wa,frp,oc,ca,rm,lld,fur,lij,lmo,es,pt,gl,lad,an,mwl,it,co,nap,scn,vec,sc,ro,la']

def translate(texts, model, tokenizer, language="fr"):
    with torch.no_grad():
        template = lambda text: f"{text}" if language == "en" else f">>{language}<< {text}"
        src_texts = [template(text) for text in texts]
        encoded = tokenizer.prepare_seq2seq_batch(src_texts, 
                                                      max_length=300, return_tensors="pt").to(device)   
        translated = model.module.generate(**encoded).to(device)
        translated_texts = tokenizer.batch_decode(translated, skip_special_tokens=True)
        return translated_texts

def back_translate(texts, source_lang="en", target_lang="fr"):
    # Translate from source to target language
    fr_texts = translate(texts, target_model, target_tokenizer, 

    # Translate from target language back to source language
    back_translated_texts = translate(fr_texts, en_model, en_tokenizer, 
    return back_translated_texts

target_model_name = 'Helsinki-NLP/opus-mt-en-de'
target_tokenizer = MarianTokenizer.from_pretrained(target_model_name)
target_model = MarianMTModel.from_pretrained(target_model_name)

en_model_name = 'Helsinki-NLP/opus-mt-de-en'
en_tokenizer = MarianTokenizer.from_pretrained(en_model_name)
en_model = MarianMTModel.from_pretrained(en_model_name)

target_model = nn.DataParallel(target_model)    
target_model = # same performance  if I add .half()

en_model = nn.DataParallel(en_model)    
en_model = same performance if I add .half()

## x1 and x2 are batches of strings. 
bk_x1 = back_translate(x1, source_lang="en", target_lang=np.random.choice(target_langs))   
bk_x2 = back_translate(x2, source_lang="en", target_lang=np.random.choice(target_langs))

here are GPU’s performances: low utilization due to small batch size 16 but if I increase the batch size I got Cuda out of memory error. also, I can see only one gpu is used for processing so might be that the Marian model can not be parallelized correctly. if so what would be the solution?

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce GTX 108...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 42%   78C    P2   199W / 250W |   9777MiB / 11178MiB |     91%      Default |
|                               |                      |                  N/A |
|   1  GeForce GTX 108...  Off  | 00000000:1C:00.0 Off |                  N/A |
| 29%   36C    P8    10W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
|   2  GeForce GTX 108...  Off  | 00000000:1D:00.0 Off |                  N/A |
| 31%   36C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
|   3  GeForce GTX 108...  Off  | 00000000:1E:00.0 Off |                  N/A |
| 35%   41C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
|   4  GeForce GTX 108...  Off  | 00000000:3D:00.0 Off |                  N/A |
| 29%   34C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
|   5  GeForce GTX 108...  Off  | 00000000:3F:00.0 Off |                  N/A |
| 30%   31C    P8     8W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
|   6  GeForce GTX 108...  Off  | 00000000:40:00.0 Off |                  N/A |
| 31%   38C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
|   7  GeForce GTX 108...  Off  | 00000000:41:00.0 Off |                  N/A |
| 30%   37C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A     58780      C   python                          10407MiB |
|    1   N/A  N/A     58780      C   python                              0MiB |
|    2   N/A  N/A     58780      C   python                              0MiB |
|    3   N/A  N/A     58780      C   python                              0MiB |
|    4   N/A  N/A     58780      C   python                              0MiB |
|    5   N/A  N/A     58780      C   python                              0MiB |
|    6   N/A  N/A     58780      C   python                              0MiB |
|    7   N/A  N/A     58780      C   python                              0MiB |

FYI: I’m using
pytorch 1. 1.7.0
transformers 4.0.1
cudda 10.1

The problem might be related to the tokenizer rather than the model. The MarianTokenizer does not have a rust (fast) implementation which lay cause a bottleneck, no matter how many GPUs you use. It might be a good idea to preprocess (tokenizer) your dataset once and use datasets for on-the fly superfast access to that cached dataset.

1 Like

Thank’s Bram for your reply. but I already tried that no performance increase. still, only one GPU is utilized and the same slow performance. In addition, it’s only can be done for the forward translation but not for the backward translation

cc @patrickvonplaten

This seems to be the issue with your parallelization code. There is a distributed evaluation scripts for seq2seq models here, you could try to modify it for back-translation

1 Like

thanks for your reply. I checked the examples. basically, they are use cases for distributed training ( on different machines) but my problem is that the batch is not parallelized over multiple GPUs.

the script does distribute evaluation, it allows you to do generation on multiple gpu’s. You can find the script here

I already saw the example but the example is for data distributed parallel which is for distributed training on different machines but in my case, I have multiple gpu’s on a single machine for which we usually use pytorch nn.DataParallel() as I used in my code. the problem is that the pre-trained MarianModel is not parallelized over 8 gpu’s as it’s shown in nvidia-smi output.

That’s not correct. On a single machine with multiple GPUs you often want to opt for DDP, too, as it performs much better than DP. You can check the official pytorch documentation for this if you’re interested, but if you can, I always advise to use DDP for true parallelism (true multiprocessing rather than single process multithreading with GIL slow down).

Alse see: