@valhalla I tried to use today on multigpu and after fixing two bugs here is stream of consciousness from watching a marian training run:
-
For Seq2SeqTrainer,
--do_eval
is mandatory.
Otherwise, you get ValueError: Trainer: evaluation requires an eval_dataset.
at the end of epoch 1.
If it’s mandatory the user should not need to specify it.
-
This may be more of an @sgugger question, but I don’t understand why I need to specify run name and output dir. Can we have run_name = output_dir if run_name is not specified?
-
This stuff should all be rounded (and maybe lr and loss should just be part of the prog bar)
{'loss': 6397.63612890625, 'learning_rate': 0.0003, 'epoch': 0.10484378276368211, 'total_flos': 2023859842449408, 'step': 500}
Is total flos supposed to be total flops? Why would I want to see that?
Isn’t it just step * some constant?
Would be dope if these eval logs
{'eval_loss': 6391.572265625, 'eval_bleu': 25.9505, 'eval_gen_len': 33.37937937937938, 'epoch': 1.2581253931641854, 'total_flos': 23975950948909056, 'step': 6000}
{'eval_loss': 6391.572265625, 'eval_bleu': 25.9505, 'eval_gen_len': 33.37937937937938, 'epoch': 1.2581253931641854, 'total_flos': 24064800096190464, 'step': 6000}
Only got printed by rank 0 and also got written to disk by rank zero.
My progress bar says
After finishing 2/6 epochs.
Iteration does not clarify anything for me. I thought it was the whole job at the beginning. I wish that one would be called `f"Epoch {n}" Say “Train”
I think I also have 2 progress bars, one from each proc.
- log_history.json is replicated inside every checkpoint dir? why?
dbg_distributed_mar/checkpoint-6000/log_history.json
in the PL version you can always can output_dir/metrics.json and see full, up to date metrics. You don’t have to ls
first to find your most recent checkpoint dir and then cat log_history.json
- We should be rounding stuff in log_history.json
{
"loss": 4743.288, # YAY
"learning_rate": 0.00020929785871807638,
"epoch": 1.887188089746278,
"total_flos": 36036642601893888,
"step": 9000
},
{
"eval_loss": 6336.4580078125,
"eval_bleu": 26.2017,
"eval_gen_len": 33.48048048048048,
"epoch": 1.887188089746278,
"total_flos": 36036642601893888,
"step": 9000
I also think total_flos is distracting and useless, but whatever.