How To Output "test_generations.txt" with run_seq2seq.py?

@stas or @sgugger would likely be able to answer this easily – thanks again for your comments on my previous query in December.

I was using the finetune_trainer.py script back in December, and found that running a script like this…

python3 -m torch.distributed.launch --nproc_per_node=8 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
    --learning_rate=1e-4  \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_test 100 \
    --fp16 \
    --sortish_sampler \
    --num_train_epochs 24 \
    --data_dir "/workspace/rabbit-py/corpii/short_name_sequential_source" \
    --model_name_or_path "google/pegasus-large" \
    --output_dir "/workspace/rabbit-py/predictions/$RUN" \
    --per_device_train_batch_size 2\
    --per_device_eval_batch_size  2\
    --logging_steps 768\
    --gradient_accumulation_steps 32\
    --task 'summarization'\
    --max_target_length 12 \
    --val_max_target_length 12 \
    --test_max_target_length 12 \
    --overwrite_output_dir \
    --freeze_embeds \
    --adafactor \
    --run_name $RUN
    "$@"

… would output checkpoint folders that looked like this:
image

The test_generations.txt file was exactly 100 lines long, so I assume it corresponded to the --n_test 100 argument, although I can’t be sure, as I struggled for a while to understand the difference between predict and eval and test, and eventually gave up as the terminology was just too confusing for me to understand.

That said, the test_generations.txt file was generated and it was very useful.

I have now migrated to the new seq2seq script, run_seq2seq.py, from here: https://github.com/huggingface/transformers/tree/master/examples/seq2seq

I am successfully using this, with a script like this:

PREFIX=$(basename $BASH_SOURCE) 

python3 /workspace/fw-py/transformers/examples/seq2seq/run_seq2seq.py \
    --model_name_or_path '/workspace/fw-py/models_foreign/pegasus_large' \
    --do_train \
    --do_eval \
    --do_predict \
    --logging_steps 768 \
    --evaluation_strategy steps \
    --num_train_epochs 10 \
    --task summarization \
    --train_file "/workspace/fw-py/corpii/${PREFIX}/train.json" \
    --validation_file "/workspace/fw-py/corpii/${PREFIX}/val.json" \
    --test_file "/workspace/fw-py/corpii/${PREFIX}/test.json" \
    --output_dir "/workspace/fw-py/predictions/${PREFIX}" \
    --overwrite_output_dir \
    --per_device_train_batch_size=2 \
    --per_device_eval_batch_size=2 \
    --predict_with_generate \
    --text_column "question" \
    --summary_column "known_answer"
    

Again, I don’t understand the difference between do_eval, and do_predict, and I am not sure what predict_with_generate really means, and can’t find that documented clearly anywhere, so I am just using all of them.

This script is working, and is generating checkpoint folders that look like this:
image

… which is a great start, however, I am missing the critical file that I need, to see what my model outputs… this file I am missing is the test_generations.txt file.

Does anyone know if it is still possible to generate these test generations?

I did consult the --help command, and found …

--do_predict [DO_PREDICT] Whether to run predictions on the test set.
--predict_with_generate [PREDICT_WITH_GENERATE] Whether to use generate to calculate generative metrics (ROUGE, BLEU).

Which, although I don’t really understand what this means, does seem like something that could help create the test_generations.txt, but that does not seem to be happening in my case.

Also, FYI, I am running the script again, trying just 3 epochs, and here is the first output from the console, which I think should show all of my arguments:

02/22/2021 19:45:54 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
02/22/2021 19:45:54 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/workspace/fw-py/predictions/translated_one', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, evaluation_strategy=<EvaluationStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=2, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Feb22_19-45-54_43a398359e63', logging_first_step=False, logging_steps=768, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', fp16_backend='auto', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=768, dataloader_num_workers=0, past_index=-1, run_name='/workspace/fw-py/predictions/translated_one', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard', 'wandb'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, sortish_sampler=False, predict_with_generate=True)

Thanks!

OK, it looks like I can now answer my own question. I can’t find any documentation of this, but it seems that the predict_with_generate flag WILL generation a text file with predictions, but not for each checkpoint – instead, it happens at the end, after all the epochs are complete.

The relevant code: transformers/run_seq2seq.py at f991daed185261085301d72c2cd634836df1044a · huggingface/transformers · GitHub

If anyone figures out a way to get run_seq2seq.py to generate these predictions at each checkpoint, my understanding was that this was the previous behavior, and it was certainly useful for me…

I think it’s just a matter of filing an Issue requesting to restore the dropped functionality.

My feeling is that since these are examples there is no clear definition of what’s important and that’s why this kind of events happen at times and perfectly working previous functionality disappears.

Bottom line: if you see something else that is missing in this new incarnation of the script please file an Issue showing how it worked before so that it can either be restored or a new way be found to accomplish the same.

Hi @the-pale-king , did you test successfully by using --do_predict? I failed to use --do_predict to test neither CNNDM nor XSUM, as the out-of-memory problem even I set the test batch size to 1. I use the latest transformers 4.4.0-dev

Is your GPU running out of memory? Try it with the --no-cuda flag to run it without GPU … for me I can run T5-small on a 8GB vram gpu, and Pegasus-large on a 24GB vram GPU – but lots of other combinations of models and gpus end with out of memory errors

@the-pale-king Yes, even I use 48G ram GPU, it still suffers from out of memory problems when I try to test the bart-large model. Anyway, thank you for your advice! I’m trying to test BART-large on the XSUM dataset on CPU now, it seems ok, but takes 3 hours.