Some unintended things happen in Seq2SeqTrainer example

Hi,

Congratulations to HuggingFace Transformers for winning the Best Demo Paper Award at EMNLP 2020!
I’m now trying v4.0.0-rc-1 with great interest.
If you don’t mind, I’d like to ask you about what seems strange during running the Seq2SeqTrainer example.

I’m sorry if I’m mistaken or if the problem is dependent on the environment, but I’d be happy if you look over it.

What seems strange

  • The number of data pairs is not correctly recognized.
  • MLflow cannot treat the params (too long).

I wasn’t sure if I should divide these into two topics, but in the end, I decided on one.
If it is better to divide them into two, I will modify it.

Environment

  • transformers version: 4.0.0-rc-1
    • The latest commit: commit 5ced23dc845c76d5851e534234b47a5aa9180d40
  • Platform: Linux-4.15.0-123-generic-x86_64-with-glibc2.10
  • Python version: 3.8.3
  • PyTorch version (GPU?): 1.7.0 (True)
  • Tensorflow version (GPU?): 2.3.1 (True)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Script and Parameters

I first noticed this strangeness when I use a different dataset than the those in the example.
I again follow the README of examples/seq2seq to check if my modification causes the problem or not.

Having checked https://github.com/huggingface/transformers/issues/8792, I used --evaluation_strategy epoch instead of --evaluate_during_training.

$ CUDA_VISIBLE_DEVICES=0 python finetune_trainer.py \
    --data_dir $XSUM_DIR \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy epoch \
    --predict_with_generate \
    --n_val 1000 \
    --model_name_or_path facebook/bart-large \
    --output_dir ./xsum_bart-large/ \
    --save_total_limit 5 \
    2>&1 | tee tmp.log

Log

[INFO|trainer.py:667] 2020-11-30 08:10:43,836 >> ***** Running training *****
[INFO|trainer.py:668] 2020-11-30 08:10:43,836 >>   Num examples = 204016
[INFO|trainer.py:669] 2020-11-30 08:10:43,836 >>   Num Epochs = 3
[INFO|trainer.py:670] 2020-11-30 08:10:43,836 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:671] 2020-11-30 08:10:43,836 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:672] 2020-11-30 08:10:43,836 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:673] 2020-11-30 08:10:43,836 >>   Total optimization steps = 76506

...

mlflow.exceptions.MlflowException: Param value '{'summarization': {'length_penalty': 1.0, 'max_length': 128, 'min_length': 12, 'num_beams': 4}, 'summarization_cnn': {'length_penalty': 2.0, 'max_length': 142, 'min_length': 56, 'num_beams': 4}, 'summarization_xsum': {'length_penalty': 1.0, 'max_leng' had length 293, which exceeded length limit of 250

(Reference) Dataset length

$ cd $XSUM_DIR/
$ wc -l *
    11333 test.source
    11333 test.target
   204017 train.source
   204017 train.target
    11327 val.source
    11327 val.target
   453354 total

Details

The number of examples shown

At first, I tried to use the dataset with 40,000 pairs for training, but it was shown that Num examples = 39999.
I don’t know why, so I’ve checked the example with the XSum dataset.

Checking the number of lengths, it seems the XSum train set used in the example has 204017 pairs, but it is shown Num examples = 204016 as above.

I thought the dataset was supposed to start with the first line, but am I mistaken? For example, is the first line treated as a header?

MLflow can not treat params in this case

As shown above, the length of param value exceeds the limit that MLflow can handle.
Do I just need to change the settings of MLflow? Or, should I add some modifications to param value to be used in MLflow?

Thank you in advance.

yusukemori

hey there. It seems as if you have encountered some bugs with the trainer. Cool, that is very helpful! The forum may not be the best place to post this, though, as it servs more the purpose for general questions. If you believe these are bugs, can you instead post this in the bug tracker on Github? You can include a link to this forum post as well.

1 Like

Thank you for checking the post and telling me what I should do next.
I now understood the forum is for more the purpose for general questions.
From now on, I will post something I believe as bugs in the bug tracker on Github.

I’d like to post this topic in there soon.

1 Like

I created the issue.
Thank you for your advice about where to post!
https://github.com/huggingface/transformers/issues/8849

1 Like