Summarization: Is finetune_trainer.py accepting length arguments correctly?

the-pale-king · December 18, 2020, 8:16pm

Hi, thanks for this impressive library - I expect Huggingface to shortly take over the world. This is my first post.

I am using the most recent version of the library, cloned from master, as of 12-16-2020, specifically the code from here: https://github.com/huggingface/transformers/tree/master/examples/seq2seq.

It looks like @stas, and @sgugger have most recently touched this code, and might be best positioned to tell me what stupid mistake I am making.

I am trying to do some summarization with finetune_trainer.py.

As a proof of concept, I first started with the xsum dataset, running this shell script:

RUN="xsum-1500-train"

python3 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_train 1500 \
    --n_val 300 \
    --n_test 100 \
    --num_train_epochs 1 \
    --data_dir "/workspace/rabbit-py/corpii_foreign/xsum" \
    --model_name_or_path "t5-small" \
    --output_dir "/workspace/rabbit-py/predictions/$RUN" \
    --per_device_train_batch_size 5 \
    --per_device_eval_batch_size 8\
    --task 'summarization' \
    --overwrite_output_dir \
    --run_name $RUN
    "$@"

This works well, and in about two minutes (using 2x RTX 2070 Super), generates text in the test_generations.txt output file.

Here is the first line of output in the test_generations.txt output file:

the trio are up for best UK act and best album, as well as two nominations in the best song category. they have been nominated for their favourite album, Number One and Strong Again.

This is indeed a summary of the originating text, in the first line of test.source

The London trio are up for best UK act and best album, as well as getting two nominations in the best song category."We got told like this morning 'Oh I think you're nominated'", said Dappy."And I was like 'Oh yeah, which one?' And now we've got nominated for four awards. I mean, wow!"Bandmate Fazer added: "We thought it's best of us to come down and mingle with everyone and say hello to the cameras. And now we find we've got four nominations."The band have two shots at the best song prize, getting the nod for their Tynchy Stryder collaboration Number One, and single Strong Again.Their album Uncle B will also go up against records by the likes of Beyonce and Kanye West.N-Dubz picked up the best newcomer Mobo in 2007, but female member Tulisa said they wouldn't be too disappointed if they didn't win this time around."At the end of the day we're grateful to be where we are in our careers."If it don't happen then it don't happen - live to fight another day and keep on making albums and hits for the fans."Dappy also revealed they could be performing live several times on the night.The group will be doing Number One and also a possible rendition of the War Child single, I Got Soul.The charity song is a  re-working of The Killers' All These Things That I've Done and is set to feature artists like Chipmunk, Ironik and Pixie Lott.This year's Mobos will be held outside of London for the first time, in Glasgow on 30 September.N-Dubz said they were looking forward to performing for their Scottish fans and boasted about their recent shows north of the border."We just done Edinburgh the other day," said Dappy."We smashed up an N-Dubz show over there. We done Aberdeen about three or four months ago - we smashed up that show over there! Everywhere we go we smash it up!"

So far, so good.

Note that I am running only 1 epoch, and only using a small fraction of the data, because at this point I only want to do a proof of concept.

I now want to try a proof of concept on my own data.

My own training data uses shorter length text, both in input, and output.

For my training data I am trying to summarize text like:

HSH Solid Wood Bookshelf, 2 Tier Rustic Vintage Industrial Etagere Bookcase, Open Metal Farmhouse Book Shelf, Distressed Brown

…and end up with a summary like:

HSH Rustic Industrial

This example, as you can see, happens to fit the description of an “extractive” summarization, where all the text in the training target is included in the training source, but not all of my rows are like that – many of my rows might require something closer to an “abstractive” summarization. (Just FYI).

So, as a proof of concept, I now try a minimal modification of my script, just putting in my data directory, instead of xsum:

RUN="sn-vs-n-1-simple"

python3 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_train 1500 \
    --n_val 300 \
    --n_test 100 \
    --num_train_epochs 1 \
    --data_dir "/workspace/rabbit-py/corpii/short_name_vs_name" \
    --model_name_or_path "t5-small" \
    --output_dir "/workspace/rabbit-py/predictions/$RUN" \
    --per_device_train_batch_size 5 \
    --per_device_eval_batch_size 8\
    --task 'summarization' \
    --overwrite_output_dir \
    --run_name $RUN
    "$@"

I run this, and… given this first line of test.source…

Bloggerlove Rain Jacket Women Lightweight Raincoat Waterproof Windbreaker Striped Climbing Outdoor Hooded Trench Coats S-Xxl

… the first line of test_generations.txt is:

Bloggerlove Rain Jacket Women Lightweight Raincoat Waterproof Windbreaker Striped Climbing Outdoor Hooded Trench Coats S-Xxl

… whereas the first line of test.target is:

Bloggerlove Hooded Trench Coat

… and the second line of test.source is:

Sony Portable Bluetooth Digital Turner AM/FM CD Player Mega Bass Reflex Stereo Sound System

… and the second line of test_generations.txt is:

Sony Portable Bluetooth Digital Turner AM/FM CD Player Mega Bass Reflex Stereo Sound System. Sony portable Bluetooth digital Turner MP/FM MP3 player Mega bass Reflex stereo sound system.

…whereas the second line of test.target is:
Sony Bluetooth

So clearly this is not working right!

At a most basic level, the summaries are too long… and actually, it seems that T5 is hallucinating “additional” text to add to my input text!

So, my first stop is to look at the console output:

2/18/2020 19:28:54 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 2, distributed training: False, 16-bits training: True
12/18/2020 19:28:54 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/workspace/rabbit-py/predictions/sn-vs-n-1-simple', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, model_parallel=False, evaluation_strategy=<EvaluationStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=5, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec18_19-28-54_94c29ef5e746', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='sn-vs-n-1-simple', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, label_smoothing=0.0, sortish_sampler=False, predict_with_generate=True, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
[INFO|configuration_utils.py:422] 2020-12-18 19:28:54,234 >> loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /workspace/rabbit-py/models_foreign/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
[INFO|configuration_utils.py:458] 2020-12-18 19:28:54,236 >> Model config T5Config {
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "use_cache": true,
  "vocab_size": 32128
}

[INFO|configuration_utils.py:422] 2020-12-18 19:28:54,466 >> loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /workspace/rabbit-py/models_foreign/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
[INFO|configuration_utils.py:458] 2020-12-18 19:28:54,467 >> Model config T5Config {
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "use_cache": true,
  "vocab_size": 32128
}

[INFO|tokenization_utils_base.py:1793] 2020-12-18 19:28:54,944 >> loading file https://huggingface.co/t5-small/resolve/main/spiece.model from cache at /workspace/rabbit-py/models_foreign/65fc04e21f45f61430aea0c4fedffac16a4d20d78b8e6601d8d996ebefefecd2.3b69006860e7b5d0a63ffdddc01ddcd6b7c318a6f4fd793596552c741734c62d
[INFO|tokenization_utils_base.py:1793] 2020-12-18 19:28:54,944 >> loading file https://huggingface.co/t5-small/resolve/main/tokenizer.json from cache at /workspace/rabbit-py/models_foreign/06779097c78e12f47ef67ecb728810c2ae757ee0a9efe9390c6419783d99382d.8627f1bd5d270a9fd2e5a51c8bec3223896587cc3cfe13edeabb0992ab43c529
[INFO|modeling_utils.py:1014] 2020-12-18 19:28:55,263 >> loading weights file https://huggingface.co/t5-small/resolve/main/pytorch_model.bin from cache at /workspace/rabbit-py/models_foreign/fee5a3a0ae379232608b6eed45d2d7a0d2966b9683728838412caccc41b4b0ed.ddacdc89ec88482db20c676f0861a336f3d0409f94748c209847b49529d73885
[WARNING|modeling_utils.py:1122] 2020-12-18 19:28:56,647 >> Some weights of the model checkpoint at t5-small were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[INFO|modeling_utils.py:1139] 2020-12-18 19:28:56,647 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
12/18/2020 19:28:56 - INFO - utils -   using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}
12/18/2020 19:28:58 - INFO - __main__ -   *** Train ***
[INFO|trainer.py:668] 2020-12-18 19:28:58,881 >> ***** Running training *****
[INFO|trainer.py:669] 2020-12-18 19:28:58,881 >>   Num examples = 1500
[INFO|trainer.py:670] 2020-12-18 19:28:58,881 >>   Num Epochs = 1
[INFO|trainer.py:671] 2020-12-18 19:28:58,881 >>   Instantaneous batch size per device = 5
[INFO|trainer.py:672] 2020-12-18 19:28:58,881 >>   Total train batch size (w. parallel, distributed & accumulation) = 10
[INFO|trainer.py:673] 2020-12-18 19:28:58,881 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:674] 2020-12-18 19:28:58,881 >>   Total optimization steps = 150
sn-vs-n-1-simple
[INFO|integrations.py:360] 2020-12-18 19:28:58,898 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: ghengis (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.12
wandb: Syncing run sn-vs-n-1-simple
wandb: ⭐ View project at https://wandb.ai/---/huggingface
wandb: 🚀 View run at https://wandb.ai/----/huggingface/runs/tu1t9h5g
wandb: Run data is saved locally in /workspace/rabbit-py/src/learning/wandb/run-20201218_192859-tu1t9h5g
wandb: Run `wandb offline` to turn off syncing.

  0%|          | 0/150 [00:00<?, ?it/s]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
100%|██████████| 150/150 [00:42<00:00,  3.53it/s][INFO|trainer.py:821] 2020-12-18 19:29:43,464 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


                                                 {'epoch': 1.0}
100%|██████████| 150/150 [00:42<00:00,  3.49it/s]
[INFO|trainer.py:1183] 2020-12-18 19:29:43,467 >> Saving model checkpoint to /workspace/rabbit-py/predictions/sn-vs-n-1-simple
[INFO|configuration_utils.py:289] 2020-12-18 19:29:43,471 >> Configuration saved in /workspace/rabbit-py/predictions/sn-vs-n-1-simple/config.json
[INFO|modeling_utils.py:814] 2020-12-18 19:29:43,893 >> Model weights saved in /workspace/rabbit-py/predictions/sn-vs-n-1-simple/pytorch_model.bin
12/18/2020 19:29:43 - INFO - __main__ -   ***** train metrics *****
12/18/2020 19:29:43 - INFO - __main__ -     train_samples_per_second = 33.64
12/18/2020 19:29:43 - INFO - __main__ -     train_runtime = 44.5896
12/18/2020 19:29:43 - INFO - __main__ -     train_n_ojbs = 1500
12/18/2020 19:29:43 - INFO - __main__ -   *** Evaluate ***
[INFO|trainer.py:1369] 2020-12-18 19:29:43,950 >> ***** Running Evaluation *****
[INFO|trainer.py:1370] 2020-12-18 19:29:43,950 >>   Num examples = 300
[INFO|trainer.py:1371] 2020-12-18 19:29:43,951 >>   Batch size = 16
100%|██████████| 19/19 [00:20<00:00,  1.06s/it]
12/18/2020 19:30:05 - INFO - __main__ -   ***** val metrics *****
12/18/2020 19:30:05 - INFO - __main__ -     val_loss = 2.1546
12/18/2020 19:30:05 - INFO - __main__ -     val_rouge1 = 26.3976
12/18/2020 19:30:05 - INFO - __main__ -     val_rouge2 = 13.6039
12/18/2020 19:30:05 - INFO - __main__ -     val_rougeL = 26.1308
12/18/2020 19:30:05 - INFO - __main__ -     val_rougeLsum = 26.18
12/18/2020 19:30:05 - INFO - __main__ -     val_gen_len = 37.9
12/18/2020 19:30:05 - INFO - __main__ -     epoch = 1.0
12/18/2020 19:30:05 - INFO - __main__ -     val_samples_per_second = 14.091
12/18/2020 19:30:05 - INFO - __main__ -     val_runtime = 21.29
12/18/2020 19:30:05 - INFO - __main__ -     val_n_ojbs = 300
12/18/2020 19:30:05 - INFO - __main__ -   *** Predict ***
[INFO|trainer.py:1369] 2020-12-18 19:30:05,241 >> ***** Running Prediction *****
[INFO|trainer.py:1370] 2020-12-18 19:30:05,241 >>   Num examples = 100
[INFO|trainer.py:1371] 2020-12-18 19:30:05,241 >>   Batch size = 16
100%|██████████| 7/7 [00:05<00:00,  1.23it/s]12/18/2020 19:30:12 - INFO - __main__ -   ***** test metrics *****
12/18/2020 19:30:12 - INFO - __main__ -     test_loss = 2.2199
12/18/2020 19:30:12 - INFO - __main__ -     test_rouge1 = 27.7161
12/18/2020 19:30:12 - INFO - __main__ -     test_rouge2 = 13.4332
12/18/2020 19:30:12 - INFO - __main__ -     test_rougeL = 27.8038
12/18/2020 19:30:12 - INFO - __main__ -     test_rougeLsum = 27.7593
12/18/2020 19:30:12 - INFO - __main__ -     test_gen_len = 37.2
12/18/2020 19:30:12 - INFO - __main__ -     test_samples_per_second = 13.715
12/18/2020 19:30:12 - INFO - __main__ -     test_runtime = 7.2913
12/18/2020 19:30:12 - INFO - __main__ -     test_n_ojbs = 100
100%|██████████| 7/7 [00:06<00:00,  1.14it/s]

… And something that jumps out at me is this:

12/18/2020 19:28:56 - INFO - utils - using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}

It seems I am using task specific params, which are asking the model for a max_legnth of 200 tokens, right?

And then I see this Github comment:

github.com/huggingface/transformers

Summarization pipeline max_length parameter seems to just cut the summary rather than generating a complete sentence within the max length

opened 09:50PM - 01 Apr 20 UTC

closed 11:17AM - 02 Apr 20 UTC

Weilin37

# 🐛 Bug ## Information Model I am using (Bert, XLNet ...): default model f…rom pipeline("summarization") Language I am using the model on (English, Chinese ...): English I am using the pipeline for summarization in most up to date version of Transformers. I am inputing a long piece of tax and setting the summarizer to be: summarizer(PIECE_OF_TEXT, max_length = 50). I was expecting the summarizer to generate a summary within 50 words but it seems to only generate a summary that seems cut off (the ending of the summary ends with a comma and doesn't end in a grammatical sensible way. See example below. **The piece of text to be summarized:** Renal-cell carcinoma is characterized by susceptibility to both immunotherapeutic and antiangiogenic treatment approaches and resistance to cytotoxic chemotherapy.1 Agents such as sunitinib that target the vascular endothelial growth factor (VEGF) pathway are standard first-line therapy for advanced disease.2-7 Despite the approval of several targeted therapies by entities such as the Food and Drug Administration, the European Medicines Agency, and the Pharmaceuticals and Medical Devices Agency, the survival rate among patients with metastatic renal-cell carcinoma has plateaued. Both the VEGF receptor tyrosine kinase inhibitor axitinib and the anti–programmed death 1 (PD-1) monoclonal antibody pembrolizumab have shown antitumor activity in patients with previously untreated advanced clear-cell renal-cell carcinoma.6,10 In a phase 1b trial involving patients with previously untreated metastatic renal-cell carcinoma, 73% (95% confidence interval [CI], 59 to 84) of the patients who received pembrolizumab plus axitinib had a response; 65% of patients had at least one treatment-related adverse event.11 We conducted the KEYNOTE-426 trial to determine whether pembrolizumab plus axitinib would result in better outcomes than sunitinib in patients with previously untreated advanced renal-cell carcinoma. **And the summary:** Renal-cell carcinoma is characterized by susceptibility to both immunotherapeutic and antiangiogenic treatment approaches. Agents such as sunitinib that target the vascular endothelial growth factor (VEGF) pathway are standard first, axitinib and the anti–programmed death 1 (PD-1) monoclonal antibody pembrolizumab have shown antitumor activity in patients with previously untreated advanced clear-cell renal-cell carcin,

As @aychang95 suggested you have to play around with the generate method arguments to see what works best for your example. Especially take a look at num_beams, max_length, min_length, early_stopping and length_penalty.

So my idea is: I should shorten this max_length. My target summaries never go over 50 tokens, so I should tell this to T5!

I reference the --help in finetune.trainer.py

  --max_target_length MAX_TARGET_LENGTH
                        The maximum total sequence length for target text
                        after tokenization. Sequences longer than this will be
                        truncated, sequences shorter will be padded.
  --val_max_target_length VAL_MAX_TARGET_LENGTH
                        The maximum total sequence length for validation
                        target text after tokenization. Sequences longer than
                        this will be truncated, sequences shorter will be
                        padded.
  --test_max_target_length TEST_MAX_TARGET_LENGTH
                        The maximum total sequence length for test target text
                        after tokenization. Sequences longer than this will be
                        truncated, sequences shorter will be padded.

So it seems these arguments might do something. I can’t personally figure out why these values should be different, I mean, shouldn’t all these values match the maximum prediction length that I want? So I assume that is the case, for the time being.

So next, I run this script:

RUN="sn-vs-n-1-with-target-length"

python3 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_train 1500 \
    --n_val 300 \
    --n_test 100 \
    --num_train_epochs 1 \
    --data_dir "/workspace/rabbit-py/corpii/short_name_vs_name" \
    --model_name_or_path "t5-small" \
    --output_dir "/workspace/rabbit-py/predictions/$RUN" \
    --per_device_train_batch_size 5 \
    --per_device_eval_batch_size 8 \
    --max_target_length 50 \
    --val_max_target_length 50 \
    --test_max_target_length 50 \
    --overwrite_output_dir \
    --run_name $RUN
    "$@"

The only change here are these added arguments:

  --max_target_length 50 \
  --val_max_target_length 50 \
   --test_max_target_length 50 \

… this script finishes… and then I find that the newly generated test_generations.txt are exactly the same!

So, as far as I can tell, these three added arguments have had no effect…!

and… the console output contains the same thing:

"task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },

...

12/18/2020 19:52:35 - INFO - utils -   using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}
12/18/2020 19:52:38 - INFO - __main__ -   *** Train ***

So, my rough guess here is that somehow these max_target_length arguments are being overridden. I re-run the script, this time, removing this line:

--task 'summarization' \

… But once again, I get the same “too long” summary, and in the console, I see the same using task specific params….

So my guess at this point is there might be either a bug or something lacking in the documentation, something that needs to be done to over-ride task specific params when using finetune_trainer.py?

Or (quite possibly) I’m doing something else wrong??

thanks!

stas · December 18, 2020, 10:23pm

From a quick look it appears that your diagnosis might be correct.

I can see how those length args are used to truncate the records in datasets, but model.config remains unmodified, so when it comes to generate it uses the task specific param defaults.

Most likely after use_task_specific_params() is run, model.config needs to be overriden again with user overrides.

So something like:

--- a/examples/seq2seq/finetune_trainer.py
+++ b/examples/seq2seq/finetune_trainer.py
@@ -205,6 +205,10 @@ def main():

     # use task specific params
     use_task_specific_params(model, data_args.task)
+    if model.config.max_length is not None and data_args.max_target_length is not None:
+        print(f"before {model.config.max_length}")
+        model.config.max_length = data_args.max_target_length
+        print(f"after {model.config.max_length}")

     # set num_beams for evaluation
     if data_args.eval_beams is None:

So using your last command line (btw, I think it’s missing --task summarization)

2020-12-18 14:30:47 | INFO | utils | using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}
before 200
after 50

but then there are 3 of those.

but first please test if this one makes a difference.

I’m using cnn_dm db from README.md to test this:

./finetune_trainer.py --learning_rate=3e-5 --fp16 --do_train --do_eval --do_predict \
--evaluation_strategy steps --predict_with_generate --n_train 100 --n_val 100 --n_test 100 \
--num_train_epochs 1 --data_dir cnn_dm --model_name_or_path "t5-small" --output_dir output_dir \
--per_device_train_batch_size 5 --per_device_eval_batch_size 8 --max_target_length 50 \
--val_max_target_length 50 --test_max_target_length 50 --overwrite_output_dir --task summarization

stas · December 18, 2020, 11:50pm

Reading more, it appears that max_target_length and its 3 friends are there specifically to truncate the dataset records, but there are simply no user overrides for generate()s: (edit this is not so, see my later comment as I found it after closer inspection, but the rest of this comment is valid)

max_length ( int , optional, defaults to 20) – The maximum length of the sequence to be generated.
min_length ( int , optional, defaults to 10) – The minimum length of the sequence to be generated.

So most likely new flags need to be added that override these 2 model.config’s defaults.

I see currently the script overrides these 4 model.config’s defaults:

  --encoder_layerdrop ENCODER_LAYERDROP
                        Encoder layer dropout probability. Goes into
                        model.config.
  --decoder_layerdrop DECODER_LAYERDROP
                        Decoder layer dropout probability. Goes into
                        model.config.
  --dropout DROPOUT     Dropout probability. Goes into model.config.
  --attention_dropout ATTENTION_DROPOUT
                        Attention dropout probability. Goes into model.config.

So perhaps what’s needed is generate_max_length and generate_min_length

or perhaps min_gen_length and max_gen_length to somewhat match:

–max_source_length MAX_SOURCE_LENGTH
–max_target_length MAX_TARGET_LENGTH

as these aren’t very clear on what they apply to.

the-pale-king · December 18, 2020, 11:55pm

Awesome, thanks for the reply. That was clearly my confusion, I didn’t realize that the existing max_* arguments were about dataset truncation. Now this makes more sense.

stas · December 19, 2020, 12:17am

And looking some more again, it looks like val_max_target_length is used in generate and overrides model.config.max_length as you can see here:

github.com

huggingface/transformers/blob/3ff5e8955adbfcb35e58d1f5575370f0efee2b09/examples/seq2seq/seq2seq_trainer.py#L212-L224


gen_kwargs = {
    "max_length": self.data_args.val_max_target_length
    if self.data_args is not None
    else self.config.max_length,
    "num_beams": self.data_args.eval_beams if self.data_args is not None else self.config.num_beams,
}
if self.args.predict_with_generate and not self.args.prediction_loss_only:
    generated_tokens = self.model.generate(
        inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        **gen_kwargs,
    )

So actually there is a working solution now that we know which of the 4 args is used to override max_length.

I double checked that it is so with:

diff --git a/examples/seq2seq/seq2seq_trainer.py b/examples/seq2seq/seq2seq_trainer.py
index 32a96555..7d8f4741 100644
--- a/examples/seq2seq/seq2seq_trainer.py
+++ b/examples/seq2seq/seq2seq_trainer.py
@@ -216,6 +216,10 @@ class Seq2SeqTrainer(Trainer):
             "num_beams": self.data_args.eval_beams if self.data_args is not None else self.config.num_beams,
         }

+        logger.info(f"***** generate args *****")
+        for k, v in sorted(gen_kwargs.items()):
+            logger.info(f"  {k} = {v}")
+
         if self.args.predict_with_generate and not self.args.prediction_loss_only:
             generated_tokens = self.model.generate(
                 inputs["input_ids"],

So getting:

2020-12-18 16:21:38 | INFO | seq2seq_trainer | ***** generate args *****
2020-12-18 16:21:38 | INFO | seq2seq_trainer |   max_length = 50
2020-12-18 16:21:38 | INFO | seq2seq_trainer |   num_beams = 4

So overriding is happening.

But why with self.data_args.val_max_target_length only I don’t know.

So 2 possible things to do here:

either add explicit --min_gen_length and --max_gen_length args and pass those into generate or at the very least document that --val_max_target_length has double usage - one for validation dataset truncation and a secondary use for generate's max_length override.
Perhaps that comment about use task specific params should be amended to say that further overrides may happen since the info logger doesn’t report that model.config.max_length was effectively set to self.data_args.val_max_target_length and thus it is confusing to the user.

I submitted a PR that addresses these 2 items above.

So this flurry of comments cleared out what cl arg to use to override max_length, but I doubt it made any difference to your problem.

If the problem is still unresolved please help us at reproducing it. Ideally use the existing summarization datasets that we use for testing as explained here:

or if that doesn’t work please make a small sample that reproduces the problem with your data and copy-n-paste instructions to get it and deploy it. Thank you!

the-pale-king · December 19, 2020, 1:36am

Thanks for your detailed comments.

I believe I can here provide you with a command to reproduce this issue. (Unless I am somehow confused.)

According to my reading of your comments, this command, using the xsum dataset, should (but does not) generate summaries of fewer than 50 tokens:

RUN="xsum-1500-train-try-max-len"

python3 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_train 1500 \
    --n_val 300 \
    --n_test 100 \
    --num_train_epochs 1 \
    --data_dir "/workspace/rabbit-py/corpii_foreign/xsum" \
    --model_name_or_path "t5-small" \
    --output_dir "/workspace/rabbit-py/predictions/$RUN" \
    --per_device_train_batch_size 5 \
    --per_device_eval_batch_size 8\
    --task 'summarization' \
    --overwrite_output_dir \
    --run_name $RUN \
    --val_max_target_length 50 \
    --test_max_target_length 50 \
    --max_target_length 50 \
    "$@"

When I run this, I get summaries like

the trio are up for best UK act and best album, as well as two nominations in the best song category. they have been nominated for their favourite album, Number One and Strong Again.

Which I think are too long.

I also see these lines in the console output:

INFO - utils - using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}

"task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },

thanks again

stas · December 19, 2020, 2:23am

As I addressed in the comments above the log is confusing and there is a PR to fix that.

I checked that the code works correctly though, observe:

rm -r output_dir; USE_TF=0 PYTHONPATH="../../src" ./finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_train 150 \
    --n_val 30 \
    --n_test 10 \
    --num_train_epochs 1 \
    --data_dir "xsum" \
    --model_name_or_path "t5-small" \
    --output_dir output_dir \
    --per_device_train_batch_size 5 \
    --per_device_eval_batch_size 8\
    --task 'summarization' \
    --val_max_target_length 50 \
    --test_max_target_length 50 \
    --max_target_length 50

head -1 output_dir/test_generations.txt | wc -w
41

Now let’s set the limit to 5 tokens:

rm -r output_dir; USE_TF=0 PYTHONPATH="../../src" ./finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_train 150 \
    --n_val 30 \
    --n_test 10 \
    --num_train_epochs 1 \
    --data_dir "xsum" \
    --model_name_or_path "t5-small" \
    --output_dir output_dir \
    --per_device_train_batch_size 5 \
    --per_device_eval_batch_size 8\
    --task 'summarization' \
    --val_max_target_length 5 \
    --test_max_target_length 5 \
    --max_target_length 5

head -1 output_dir/test_generations.txt | wc -w
4

As you can see --val_max_target_length directly affects the length of the prediction sequence.

with 50 tokens max you got a 41 word summary
with 5 tokens max you got a 4 word summary

This makes sense to me if very common words are used and they are already in the dictionary as is or big parts of those words.

You can of course go into the code where the tokens are detokenized and print the length of generated tokens, you will most likely see that the max_length matches.

Please let me know whether these comparison runs address your question.

the-pale-king · December 19, 2020, 3:49am

Great, thanks so much for that. The real issue here was my (stupid) assumption about the definition of the token word. For some reason I got into my head that this meant characters, which, of course, it does not. It corresponds closer to words.

Yes, the console output might have been confusing me also.

But this is why I posted to the Beginner channel, obviously.

For anyone reading this thread in the future… use --val_max_target_length and specify the number of tokens you want (think of them as words, not characters!)

thanks

the-pale-king · December 19, 2020, 3:50am

And yes, of course, my own dataset works well, now that I have chosen the right number of tokens

julien-c · December 19, 2020, 12:17pm

@the-pale-king Consider sharing the resulting model if you can!

Topic		Replies	Views
Huge difference in speed when finetuning summarization with different scripts 🤗Transformers	4	890	August 13, 2021
BART finetuning for summarization without seq2seq trainer Beginners	1	818	October 31, 2022
Fine-tune summarization never works well Beginners	0	245	February 25, 2024
Finetuning BART for Abstractive Text Summarisation Beginners	1	5240	September 9, 2024
Finetuning T5 for Summarisation - Poor results Intermediate	1	529	April 28, 2024

Summarization: Is finetune_trainer.py accepting length arguments correctly?

Related topics