Hi,
Congratulations to HuggingFace Transformers for winning the Best Demo Paper Award at EMNLP 2020!
I’m now trying v4.0.0-rc-1 with great interest.
If you don’t mind, I’d like to ask you about what seems strange during running the Seq2SeqTrainer example.
I’m sorry if I’m mistaken or if the problem is dependent on the environment, but I’d be happy if you look over it.
What seems strange
- The number of data pairs is not correctly recognized.
- MLflow cannot treat the params (too long).
I wasn’t sure if I should divide these into two topics, but in the end, I decided on one.
If it is better to divide them into two, I will modify it.
Environment
-
transformers
version: 4.0.0-rc-1- The latest commit: commit 5ced23dc845c76d5851e534234b47a5aa9180d40
- Platform: Linux-4.15.0-123-generic-x86_64-with-glibc2.10
- Python version: 3.8.3
- PyTorch version (GPU?): 1.7.0 (True)
- Tensorflow version (GPU?): 2.3.1 (True)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Script and Parameters
I first noticed this strangeness when I use a different dataset than the those in the example.
I again follow the README of examples/seq2seq
to check if my modification causes the problem or not.
Having checked https://github.com/huggingface/transformers/issues/8792, I used --evaluation_strategy epoch
instead of --evaluate_during_training
.
$ CUDA_VISIBLE_DEVICES=0 python finetune_trainer.py \
--data_dir $XSUM_DIR \
--learning_rate=3e-5 \
--fp16 \
--do_train --do_eval --do_predict \
--evaluation_strategy epoch \
--predict_with_generate \
--n_val 1000 \
--model_name_or_path facebook/bart-large \
--output_dir ./xsum_bart-large/ \
--save_total_limit 5 \
2>&1 | tee tmp.log
Log
[INFO|trainer.py:667] 2020-11-30 08:10:43,836 >> ***** Running training *****
[INFO|trainer.py:668] 2020-11-30 08:10:43,836 >> Num examples = 204016
[INFO|trainer.py:669] 2020-11-30 08:10:43,836 >> Num Epochs = 3
[INFO|trainer.py:670] 2020-11-30 08:10:43,836 >> Instantaneous batch size per device = 8
[INFO|trainer.py:671] 2020-11-30 08:10:43,836 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:672] 2020-11-30 08:10:43,836 >> Gradient Accumulation steps = 1
[INFO|trainer.py:673] 2020-11-30 08:10:43,836 >> Total optimization steps = 76506
...
mlflow.exceptions.MlflowException: Param value '{'summarization': {'length_penalty': 1.0, 'max_length': 128, 'min_length': 12, 'num_beams': 4}, 'summarization_cnn': {'length_penalty': 2.0, 'max_length': 142, 'min_length': 56, 'num_beams': 4}, 'summarization_xsum': {'length_penalty': 1.0, 'max_leng' had length 293, which exceeded length limit of 250
(Reference) Dataset length
$ cd $XSUM_DIR/
$ wc -l *
11333 test.source
11333 test.target
204017 train.source
204017 train.target
11327 val.source
11327 val.target
453354 total
Details
The number of examples shown
At first, I tried to use the dataset with 40,000 pairs for training, but it was shown that Num examples = 39999
.
I don’t know why, so I’ve checked the example with the XSum dataset.
Checking the number of lengths, it seems the XSum train set used in the example has 204017 pairs, but it is shown Num examples = 204016
as above.
I thought the dataset was supposed to start with the first line, but am I mistaken? For example, is the first line treated as a header?
MLflow can not treat params in this case
As shown above, the length of param value
exceeds the limit that MLflow can handle.
Do I just need to change the settings of MLflow? Or, should I add some modifications to param value
to be used in MLflow?
Thank you in advance.
yusukemori