Hello, I’m currently running an NMT experiment using the finetune.py from examples/seq2seq. With some research, I found the idea of leveraging pre-trained models instead of training from scratch. My model aims to translate pt_BR to es_ES, so my choice was to take advantage of https://huggingface.co/Helsinki-NLP/opus-mt-pt-ca which seemed from very proximate domains. I’m using the opus dataset pt_BR and es with 55M sentence pairs (with a quick qualitative analysis I believe this data has low/medium quality but is a very considerable amount). After doing the first run and, as this requires a huge amount of processing and costs $, some questions came to my mind, you can check my experiment on Weights and Biases: https://wandb.ai/jpmc/data/runs/3hahd13a?workspace=user-jpmc .
I’m having a huge inconsistency on the GPU Usage, as you can see from the link, I have 8 Tesla K80, and tried some batch_sizes that crashed earlier, while this one (32) crashed after 41h of training. Then, when lowered to 16, it crashed even earlier (check the system information in the other run from this experiment).
One epoch is estimated to take 60hours, is this correct? I have no experience in this problem with transformers.
The val_BLEU seems to drop but not the val_loss, is it possible that the bleu value is not being logged correctly? What could cause this?
This is my first time using the library, any advice on models and hyper-parameters I could tune? Should I train from scratch?
Hi, I’ve not tried seq-to-seq (I’ve been using BERT), and I’m not an expert, but I have a few suggestions.
I suggest you don’t train from scatch. Brazilian Portugese should be very close to standard Portugese, and Catalonian Spanish is probably quite close to standard Spansih. Much closer than randomly-initialized weights would be.
I suggest you start by fine-tuning with a much smaller sample of data, so that you can find out where the problems are, and get some suitable hyperparameters.
What do you suppose is happening at 17 hours and 35 hours? Is someone else sharing your system?
If you want to train a bit and then stop, and restart from the same place, you can save the model state-dict and the optimizer state-dict.
Hello, and thanks for the reply! Will do the fine-tuning with a smaller sample o data. Another question came to my mind, should I train a new tokenizer? How are new words handled when I use the finetune.py on my dataset?
I’m the only one in the system.
Will save the model and optimizer state-dict, this is helpful.
Why run the validation less frequently? Don’t make rushed decisions about the model’s performance?!
I think that training a new tokenizer would require training from scratch. As I said, I’m not an expert, but I would imagine you don’t need to train from scratch. You could have a look at what the tokenizer is actually doing with the Brazilian Portuguese words (how it is splitting them), and see if that looks reasonable. Maybe compare it with how it splits some standard Portuguese words.
Running validation takes time and memory, so you could run it only when necessary.
Closing here, the GPU usage is still inconsistent but the training problem was mainly due to bad quality data. Thanks @rgwatwormhill for the attention! I will be opening an issue at the github repo regarding GPU usage.