How To Output "test_generations.txt" with run_seq2seq.py?

@stas or @sgugger would likely be able to answer this easily – thanks again for your comments on my previous query in December.

I was using the finetune_trainer.py script back in December, and found that running a script like this…

python3 -m torch.distributed.launch --nproc_per_node=8 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
    --learning_rate=1e-4  \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_test 100 \
    --fp16 \
    --sortish_sampler \
    --num_train_epochs 24 \
    --data_dir "/workspace/rabbit-py/corpii/short_name_sequential_source" \
    --model_name_or_path "google/pegasus-large" \
    --output_dir "/workspace/rabbit-py/predictions/$RUN" \
    --per_device_train_batch_size 2\
    --per_device_eval_batch_size  2\
    --logging_steps 768\
    --gradient_accumulation_steps 32\
    --task 'summarization'\
    --max_target_length 12 \
    --val_max_target_length 12 \
    --test_max_target_length 12 \
    --overwrite_output_dir \
    --freeze_embeds \
    --adafactor \
    --run_name $RUN
    "$@"

… would output checkpoint folders that looked like this:
image

The test_generations.txt file was exactly 100 lines long, so I assume it corresponded to the --n_test 100 argument, although I can’t be sure, as I struggled for a while to understand the difference between predict and eval and test, and eventually gave up as the terminology was just too confusing for me to understand.

That said, the test_generations.txt file was generated and it was very useful.

I have now migrated to the new seq2seq script, run_seq2seq.py, from here: transformers/examples/seq2seq at master · huggingface/transformers · GitHub

I am successfully using this, with a script like this:

PREFIX=$(basename $BASH_SOURCE) 

python3 /workspace/fw-py/transformers/examples/seq2seq/run_seq2seq.py \
    --model_name_or_path '/workspace/fw-py/models_foreign/pegasus_large' \
    --do_train \
    --do_eval \
    --do_predict \
    --logging_steps 768 \
    --evaluation_strategy steps \
    --num_train_epochs 10 \
    --task summarization \
    --train_file "/workspace/fw-py/corpii/${PREFIX}/train.json" \
    --validation_file "/workspace/fw-py/corpii/${PREFIX}/val.json" \
    --test_file "/workspace/fw-py/corpii/${PREFIX}/test.json" \
    --output_dir "/workspace/fw-py/predictions/${PREFIX}" \
    --overwrite_output_dir \
    --per_device_train_batch_size=2 \
    --per_device_eval_batch_size=2 \
    --predict_with_generate \
    --text_column "question" \
    --summary_column "known_answer"
    

Again, I don’t understand the difference between do_eval, and do_predict, and I am not sure what predict_with_generate really means, and can’t find that documented clearly anywhere, so I am just using all of them.

This script is working, and is generating checkpoint folders that look like this:
image

… which is a great start, however, I am missing the critical file that I need, to see what my model outputs… this file I am missing is the test_generations.txt file.

Does anyone know if it is still possible to generate these test generations?

I did consult the --help command, and found …

--do_predict [DO_PREDICT] Whether to run predictions on the test set.
--predict_with_generate [PREDICT_WITH_GENERATE] Whether to use generate to calculate generative metrics (ROUGE, BLEU).

Which, although I don’t really understand what this means, does seem like something that could help create the test_generations.txt, but that does not seem to be happening in my case.

Also, FYI, I am running the script again, trying just 3 epochs, and here is the first output from the console, which I think should show all of my arguments:

02/22/2021 19:45:54 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
02/22/2021 19:45:54 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/workspace/fw-py/predictions/translated_one', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, evaluation_strategy=<EvaluationStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=2, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Feb22_19-45-54_43a398359e63', logging_first_step=False, logging_steps=768, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', fp16_backend='auto', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=768, dataloader_num_workers=0, past_index=-1, run_name='/workspace/fw-py/predictions/translated_one', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard', 'wandb'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, sortish_sampler=False, predict_with_generate=True)

Thanks!

OK, it looks like I can now answer my own question. I can’t find any documentation of this, but it seems that the predict_with_generate flag WILL generation a text file with predictions, but not for each checkpoint – instead, it happens at the end, after all the epochs are complete.

The relevant code: transformers/run_seq2seq.py at f991daed185261085301d72c2cd634836df1044a · huggingface/transformers · GitHub

If anyone figures out a way to get run_seq2seq.py to generate these predictions at each checkpoint, my understanding was that this was the previous behavior, and it was certainly useful for me…

I think it’s just a matter of filing an Issue requesting to restore the dropped functionality.

My feeling is that since these are examples there is no clear definition of what’s important and that’s why this kind of events happen at times and perfectly working previous functionality disappears.

Bottom line: if you see something else that is missing in this new incarnation of the script please file an Issue showing how it worked before so that it can either be restored or a new way be found to accomplish the same.