M2M100 training does not improve model performance

I’m trying to fine-tune M2M100 using the script run_translation.py and it seems that the model is not improving.
I am using the following command:

deepspeed examples/pytorch/translation/run_translation.py \
        --deepspeed tests/deepspeed/ds_config_zero3.json \
        --model_name_or_path facebook/m2m100_418M \
        --per_device_train_batch_size 8 \
        --per_device_eval_batch_size 8 \
        --output_dir output_dir --overwrite_output_dir \
        --fp16 \
        --do_train --do_eval --do_predict \
        --max_train_samples 500 --max_eval_samples 50 --max_predict_samples 50 \
        --num_train_epochs 0.001 \
        --dataset_name wmt16 --dataset_config "ro-en" \
        --source_lang en --target_lang ro \
        --predict_with_generate --forced_bos_token ro

Just to give you an example, if I train for 1 epoch I can get 20 BLEU points in the test set, but if I train for 3 epochs I get around 10 BLEU points.
Am I doing anything wrong? Does M2M100 requires any specific hyperparameter/hyperparameter configuration?