Seq2SeqTrainer Questions

@valhalla

Questions about Seq2SeqTrainer

  1. does it know how to do multigpu sortish sampler?
  2. does it know how to sync metrics in a multigpu setting?
  3. is TPU faster than GPU ?

Experiments

  1. Would be interested to know how finetune bart-large on xsum performs, for example, esp. on TPU .
  2. Would be cool to make a blogpost showing some sort of optuna/ray search for summarization or translation finetuning. Would also be cool to know the best hparams.
    Good hparams to check are learning rate, gradient_accumulation_steps, label_smoothing, dropout.
1 Like
  1. Yes, it uses DistributedSortishSampler for multi-gpu, it’s handled here

  2. Yes, metrics are handled by Trainer and logged in the global main process. pinging @sgugger for confirmation. TPU training is multi-device, so for example you see TPU logs here for distilbart-cnn-12-3.

(Note: as generate crashes/hangs on TPU we don’t calculate BLEU, ROUGE metrics on TPU)

  1. Yes, I’ve observed 3-4x speed boost on TPU v3-8 compared to single V100
  • distilbart-cnn-12-3, 1 epoch
    • on TPU V3-8: 25.5 min
    • on V100: ~102 min
  • marin-enro-6-3 , 1 epoch
    • on TPU v3-8: ~5min
    • on V100 : 21 min

All experiments are logged in this wandb project if you want to compare yourself .

  1. Coming soon :wink:
1 Like

Yes metrics are properly accumulated and logged on the main process by Trainer.

2 Likes

Awesome I will switch after I finish the distillbart paper! (PL cannot aggregate metrics to main project at the moment and TPU is broken). Should we deprecate/delete the PL trainer at some point?

When we have something that works well for Seq2Seq definitely.
FYI, I’m full time on making Trainer better now that I’m mostly finished with the docs, so don’t hesitate to ask if you have any questions :slight_smile: I’m on the text classification examples to begin gently but will make my way up to seq2seq.

2 Likes

@valhalla I tried to use today on multigpu and after fixing two bugs here is stream of consciousness from watching a marian training run:

  1. For Seq2SeqTrainer,
    --do_eval is mandatory.
    Otherwise, you get ValueError: Trainer: evaluation requires an eval_dataset. at the end of epoch 1.
    If it’s mandatory the user should not need to specify it.

  2. This may be more of an @sgugger question, but I don’t understand why I need to specify run name and output dir. Can we have run_name = output_dir if run_name is not specified?

  3. This stuff should all be rounded (and maybe lr and loss should just be part of the prog bar)

{'loss': 6397.63612890625, 'learning_rate': 0.0003, 'epoch': 0.10484378276368211, 'total_flos': 2023859842449408, 'step': 500}

Is total flos supposed to be total flops? Why would I want to see that?
Isn’t it just step * some constant?

Would be dope if these eval logs

{'eval_loss': 6391.572265625, 'eval_bleu': 25.9505, 'eval_gen_len': 33.37937937937938, 'epoch': 1.2581253931641854, 'total_flos': 23975950948909056, 'step': 6000}
{'eval_loss': 6391.572265625, 'eval_bleu': 25.9505, 'eval_gen_len': 33.37937937937938, 'epoch': 1.2581253931641854, 'total_flos': 24064800096190464, 'step': 6000}

Only got printed by rank 0 and also got written to disk by rank zero.

My progress bar says


After finishing 2/6 epochs.

Iteration does not clarify anything for me. I thought it was the whole job at the beginning. I wish that one would be called `f"Epoch {n}" Say “Train”
I think I also have 2 progress bars, one from each proc.

  1. log_history.json is replicated inside every checkpoint dir? why?
    dbg_distributed_mar/checkpoint-6000/log_history.json

in the PL version you can always can output_dir/metrics.json and see full, up to date metrics. You don’t have to ls first to find your most recent checkpoint dir and then cat log_history.json

  1. We should be rounding stuff in log_history.json
  {
    "loss": 4743.288,  # YAY
    "learning_rate": 0.00020929785871807638,
    "epoch": 1.887188089746278,
    "total_flos": 36036642601893888,
    "step": 9000
  },
  {
    "eval_loss": 6336.4580078125,
    "eval_bleu": 26.2017,
    "eval_gen_len": 33.48048048048048,
    "epoch": 1.887188089746278,
    "total_flos": 36036642601893888,
    "step": 9000

I also think total_flos is distracting and useless, but whatever.

1 . --do_eval shouldn’t be mandatory. If we set --evaluate_during_training
without --do_eval then we see this issue. The bug is here in finetune_trainer.py, it only loads eval_dataset when --do_eval is set, it should also check for
--evaluate_during_training or the new evaluation_strategy argument.
@sgugger I think this needs to be handled in every example/ script, WDYT ?

4 . Progress bars in Trainer are bit different than PL, it creates one progress bar for each epoch and each eval loop (eval pbar is created for each device when on muli-gpu or TPU :exploding_head:). And another one to track epochs.

5 . I’m also confused about why log_history is saved in every ckpt, @sgugger may know why.

  1. I didn’t write the part with do_train/do_eval and those arguments are not used by Trainer, only the scripts that use Trainer (not at the scripts yet). Maybe we can add something in the postinit of the TrainingArguments that if do_eval is not set it defaults to evaluation_strategy != "no"?

  2. Yep, definitely possible.

  3. logs are handled badly right now IMO cleanup is scheduled for later (see 4)

  4. Progress bars are a mess in the Trainer right now, I usually run it with disable_tqdm=True for now, this is something I intend to clean after a PR I’m preparing with state + callbacks (since then progress bars can be handled nicely in a callback).

  5. log_history needs to be saved at each checkpoint to because it will be reloaded from there if you resume training. (They are not all duplicates each contain the history up until that point.)

  6. Rounding should be done by the user IMO. I have no idea why total_flos are in the logs, you should ask @teven :slight_smile:

1 Like

Hey @sshleifer ! You’re right that flos are total operations; flops are operations per second so I thought it’d be clear that flos are operations, if you’ve got a better name I’m all ears since this hasn’t actually been that clear to everyone. You’re also right that for a given model, it is just step * constant (at least nearly all the time). However, it is still interesting to compare models against each other, or different sizes of the same model, or the same model with different context sizes. Would you like it better if it was optional? I’ve been asked before in conference reviews for efficiency papers to compare floating-point operations, and it’s pretty integral to the calculator estimates, but I get that it might not be useful to everyone yet. (although I’m working on integrating calculator with Trainer on the side and should be done next week).

I’m working on cleaning the Trainer state. If total_flos is not printed but inside a field of Trainer.state doest that work for you? Does it need to be logged at the end of training only maybe?

I don’t think it should be printed as it is not really human-readable information; but it is useful, for example, to graph the loss as function of the number of flos invested in training (not only the final sum), so I feel that it should be part of the logs.

It depends on what we call logs since logs are printed :wink:
Do you want it sent to the integrations like TensorBoard/Comet/Wandb? Just accessible inside of the Trainer for some custom post-processing?

all my processes in wandb are logging so I get 8 runs :frowning: How do I modify my trainer only code?

details: Logging & Experiment tracking with W&B - #73 by brando


solution: Is wandb in Trainer configured for distributed training? - #3 by brando