Summarization: Is finetune_trainer.py accepting length arguments correctly?

Hi, thanks for this impressive library - I expect Huggingface to shortly take over the world. This is my first post.

I am using the most recent version of the library, cloned from master, as of 12-16-2020, specifically the code from here: https://github.com/huggingface/transformers/tree/master/examples/seq2seq.

It looks like @stas, and @sgugger have most recently touched this code, and might be best positioned to tell me what stupid mistake I am making.

I am trying to do some summarization with finetune_trainer.py.

As a proof of concept, I first started with the xsum dataset, running this shell script:

RUN="xsum-1500-train"

python3 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_train 1500 \
    --n_val 300 \
    --n_test 100 \
    --num_train_epochs 1 \
    --data_dir "/workspace/rabbit-py/corpii_foreign/xsum" \
    --model_name_or_path "t5-small" \
    --output_dir "/workspace/rabbit-py/predictions/$RUN" \
    --per_device_train_batch_size 5 \
    --per_device_eval_batch_size 8\
    --task 'summarization' \
    --overwrite_output_dir \
    --run_name $RUN
    "$@"

This works well, and in about two minutes (using 2x RTX 2070 Super), generates text in the test_generations.txt output file.

Here is the first line of output in the test_generations.txt output file:

the trio are up for best UK act and best album, as well as two nominations in the best song category. they have been nominated for their favourite album, Number One and Strong Again.

This is indeed a summary of the originating text, in the first line of test.source

The London trio are up for best UK act and best album, as well as getting two nominations in the best song category."We got told like this morning 'Oh I think you're nominated'", said Dappy."And I was like 'Oh yeah, which one?' And now we've got nominated for four awards. I mean, wow!"Bandmate Fazer added: "We thought it's best of us to come down and mingle with everyone and say hello to the cameras. And now we find we've got four nominations."The band have two shots at the best song prize, getting the nod for their Tynchy Stryder collaboration Number One, and single Strong Again.Their album Uncle B will also go up against records by the likes of Beyonce and Kanye West.N-Dubz picked up the best newcomer Mobo in 2007, but female member Tulisa said they wouldn't be too disappointed if they didn't win this time around."At the end of the day we're grateful to be where we are in our careers."If it don't happen then it don't happen - live to fight another day and keep on making albums and hits for the fans."Dappy also revealed they could be performing live several times on the night.The group will be doing Number One and also a possible rendition of the War Child single, I Got Soul.The charity song is a  re-working of The Killers' All These Things That I've Done and is set to feature artists like Chipmunk, Ironik and Pixie Lott.This year's Mobos will be held outside of London for the first time, in Glasgow on 30 September.N-Dubz said they were looking forward to performing for their Scottish fans and boasted about their recent shows north of the border."We just done Edinburgh the other day," said Dappy."We smashed up an N-Dubz show over there. We done Aberdeen about three or four months ago - we smashed up that show over there! Everywhere we go we smash it up!"

So far, so good.

Note that I am running only 1 epoch, and only using a small fraction of the data, because at this point I only want to do a proof of concept.

I now want to try a proof of concept on my own data.

My own training data uses shorter length text, both in input, and output.

For my training data I am trying to summarize text like:

HSH Solid Wood Bookshelf, 2 Tier Rustic Vintage Industrial Etagere Bookcase, Open Metal Farmhouse Book Shelf, Distressed Brown

…and end up with a summary like:

HSH Rustic Industrial

This example, as you can see, happens to fit the description of an “extractive” summarization, where all the text in the training target is included in the training source, but not all of my rows are like that – many of my rows might require something closer to an “abstractive” summarization. (Just FYI).

So, as a proof of concept, I now try a minimal modification of my script, just putting in my data directory, instead of xsum:

RUN="sn-vs-n-1-simple"

python3 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_train 1500 \
    --n_val 300 \
    --n_test 100 \
    --num_train_epochs 1 \
    --data_dir "/workspace/rabbit-py/corpii/short_name_vs_name" \
    --model_name_or_path "t5-small" \
    --output_dir "/workspace/rabbit-py/predictions/$RUN" \
    --per_device_train_batch_size 5 \
    --per_device_eval_batch_size 8\
    --task 'summarization' \
    --overwrite_output_dir \
    --run_name $RUN
    "$@"

I run this, and… given this first line of test.source

Bloggerlove Rain Jacket Women Lightweight Raincoat Waterproof Windbreaker Striped Climbing Outdoor Hooded Trench Coats S-Xxl

… the first line of test_generations.txt is:

Bloggerlove Rain Jacket Women Lightweight Raincoat Waterproof Windbreaker Striped Climbing Outdoor Hooded Trench Coats S-Xxl

… whereas the first line of test.target is:

Bloggerlove Hooded Trench Coat

… and the second line of test.source is:

Sony Portable Bluetooth Digital Turner AM/FM CD Player Mega Bass Reflex Stereo Sound System

… and the second line of test_generations.txt is:

Sony Portable Bluetooth Digital Turner AM/FM CD Player Mega Bass Reflex Stereo Sound System. Sony portable Bluetooth digital Turner MP/FM MP3 player Mega bass Reflex stereo sound system.

…whereas the second line of test.target is:
Sony Bluetooth

So clearly this is not working right!

At a most basic level, the summaries are too long… and actually, it seems that T5 is hallucinating “additional” text to add to my input text!

So, my first stop is to look at the console output:

2/18/2020 19:28:54 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 2, distributed training: False, 16-bits training: True
12/18/2020 19:28:54 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/workspace/rabbit-py/predictions/sn-vs-n-1-simple', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, model_parallel=False, evaluation_strategy=<EvaluationStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=5, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec18_19-28-54_94c29ef5e746', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='sn-vs-n-1-simple', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, label_smoothing=0.0, sortish_sampler=False, predict_with_generate=True, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
[INFO|configuration_utils.py:422] 2020-12-18 19:28:54,234 >> loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /workspace/rabbit-py/models_foreign/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
[INFO|configuration_utils.py:458] 2020-12-18 19:28:54,236 >> Model config T5Config {
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "use_cache": true,
  "vocab_size": 32128
}

[INFO|configuration_utils.py:422] 2020-12-18 19:28:54,466 >> loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /workspace/rabbit-py/models_foreign/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
[INFO|configuration_utils.py:458] 2020-12-18 19:28:54,467 >> Model config T5Config {
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "use_cache": true,
  "vocab_size": 32128
}

[INFO|tokenization_utils_base.py:1793] 2020-12-18 19:28:54,944 >> loading file https://huggingface.co/t5-small/resolve/main/spiece.model from cache at /workspace/rabbit-py/models_foreign/65fc04e21f45f61430aea0c4fedffac16a4d20d78b8e6601d8d996ebefefecd2.3b69006860e7b5d0a63ffdddc01ddcd6b7c318a6f4fd793596552c741734c62d
[INFO|tokenization_utils_base.py:1793] 2020-12-18 19:28:54,944 >> loading file https://huggingface.co/t5-small/resolve/main/tokenizer.json from cache at /workspace/rabbit-py/models_foreign/06779097c78e12f47ef67ecb728810c2ae757ee0a9efe9390c6419783d99382d.8627f1bd5d270a9fd2e5a51c8bec3223896587cc3cfe13edeabb0992ab43c529
[INFO|modeling_utils.py:1014] 2020-12-18 19:28:55,263 >> loading weights file https://huggingface.co/t5-small/resolve/main/pytorch_model.bin from cache at /workspace/rabbit-py/models_foreign/fee5a3a0ae379232608b6eed45d2d7a0d2966b9683728838412caccc41b4b0ed.ddacdc89ec88482db20c676f0861a336f3d0409f94748c209847b49529d73885
[WARNING|modeling_utils.py:1122] 2020-12-18 19:28:56,647 >> Some weights of the model checkpoint at t5-small were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[INFO|modeling_utils.py:1139] 2020-12-18 19:28:56,647 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
12/18/2020 19:28:56 - INFO - utils -   using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}
12/18/2020 19:28:58 - INFO - __main__ -   *** Train ***
[INFO|trainer.py:668] 2020-12-18 19:28:58,881 >> ***** Running training *****
[INFO|trainer.py:669] 2020-12-18 19:28:58,881 >>   Num examples = 1500
[INFO|trainer.py:670] 2020-12-18 19:28:58,881 >>   Num Epochs = 1
[INFO|trainer.py:671] 2020-12-18 19:28:58,881 >>   Instantaneous batch size per device = 5
[INFO|trainer.py:672] 2020-12-18 19:28:58,881 >>   Total train batch size (w. parallel, distributed & accumulation) = 10
[INFO|trainer.py:673] 2020-12-18 19:28:58,881 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:674] 2020-12-18 19:28:58,881 >>   Total optimization steps = 150
sn-vs-n-1-simple
[INFO|integrations.py:360] 2020-12-18 19:28:58,898 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: ghengis (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.12
wandb: Syncing run sn-vs-n-1-simple
wandb: ⭐ View project at https://wandb.ai/---/huggingface
wandb: 🚀 View run at https://wandb.ai/----/huggingface/runs/tu1t9h5g
wandb: Run data is saved locally in /workspace/rabbit-py/src/learning/wandb/run-20201218_192859-tu1t9h5g
wandb: Run `wandb offline` to turn off syncing.

  0%|          | 0/150 [00:00<?, ?it/s]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
100%|██████████| 150/150 [00:42<00:00,  3.53it/s][INFO|trainer.py:821] 2020-12-18 19:29:43,464 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


                                                 {'epoch': 1.0}
100%|██████████| 150/150 [00:42<00:00,  3.49it/s]
[INFO|trainer.py:1183] 2020-12-18 19:29:43,467 >> Saving model checkpoint to /workspace/rabbit-py/predictions/sn-vs-n-1-simple
[INFO|configuration_utils.py:289] 2020-12-18 19:29:43,471 >> Configuration saved in /workspace/rabbit-py/predictions/sn-vs-n-1-simple/config.json
[INFO|modeling_utils.py:814] 2020-12-18 19:29:43,893 >> Model weights saved in /workspace/rabbit-py/predictions/sn-vs-n-1-simple/pytorch_model.bin
12/18/2020 19:29:43 - INFO - __main__ -   ***** train metrics *****
12/18/2020 19:29:43 - INFO - __main__ -     train_samples_per_second = 33.64
12/18/2020 19:29:43 - INFO - __main__ -     train_runtime = 44.5896
12/18/2020 19:29:43 - INFO - __main__ -     train_n_ojbs = 1500
12/18/2020 19:29:43 - INFO - __main__ -   *** Evaluate ***
[INFO|trainer.py:1369] 2020-12-18 19:29:43,950 >> ***** Running Evaluation *****
[INFO|trainer.py:1370] 2020-12-18 19:29:43,950 >>   Num examples = 300
[INFO|trainer.py:1371] 2020-12-18 19:29:43,951 >>   Batch size = 16
100%|██████████| 19/19 [00:20<00:00,  1.06s/it]
12/18/2020 19:30:05 - INFO - __main__ -   ***** val metrics *****
12/18/2020 19:30:05 - INFO - __main__ -     val_loss = 2.1546
12/18/2020 19:30:05 - INFO - __main__ -     val_rouge1 = 26.3976
12/18/2020 19:30:05 - INFO - __main__ -     val_rouge2 = 13.6039
12/18/2020 19:30:05 - INFO - __main__ -     val_rougeL = 26.1308
12/18/2020 19:30:05 - INFO - __main__ -     val_rougeLsum = 26.18
12/18/2020 19:30:05 - INFO - __main__ -     val_gen_len = 37.9
12/18/2020 19:30:05 - INFO - __main__ -     epoch = 1.0
12/18/2020 19:30:05 - INFO - __main__ -     val_samples_per_second = 14.091
12/18/2020 19:30:05 - INFO - __main__ -     val_runtime = 21.29
12/18/2020 19:30:05 - INFO - __main__ -     val_n_ojbs = 300
12/18/2020 19:30:05 - INFO - __main__ -   *** Predict ***
[INFO|trainer.py:1369] 2020-12-18 19:30:05,241 >> ***** Running Prediction *****
[INFO|trainer.py:1370] 2020-12-18 19:30:05,241 >>   Num examples = 100
[INFO|trainer.py:1371] 2020-12-18 19:30:05,241 >>   Batch size = 16
100%|██████████| 7/7 [00:05<00:00,  1.23it/s]12/18/2020 19:30:12 - INFO - __main__ -   ***** test metrics *****
12/18/2020 19:30:12 - INFO - __main__ -     test_loss = 2.2199
12/18/2020 19:30:12 - INFO - __main__ -     test_rouge1 = 27.7161
12/18/2020 19:30:12 - INFO - __main__ -     test_rouge2 = 13.4332
12/18/2020 19:30:12 - INFO - __main__ -     test_rougeL = 27.8038
12/18/2020 19:30:12 - INFO - __main__ -     test_rougeLsum = 27.7593
12/18/2020 19:30:12 - INFO - __main__ -     test_gen_len = 37.2
12/18/2020 19:30:12 - INFO - __main__ -     test_samples_per_second = 13.715
12/18/2020 19:30:12 - INFO - __main__ -     test_runtime = 7.2913
12/18/2020 19:30:12 - INFO - __main__ -     test_n_ojbs = 100
100%|██████████| 7/7 [00:06<00:00,  1.14it/s]

… And something that jumps out at me is this:

12/18/2020 19:28:56 - INFO - utils - using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}

It seems I am using task specific params, which are asking the model for a max_legnth of 200 tokens, right?

And then I see this Github comment:

As @aychang95 suggested you have to play around with the generate method arguments to see what works best for your example. Especially take a look at num_beams, max_length, min_length, early_stopping and length_penalty.

So my idea is: I should shorten this max_length. My target summaries never go over 50 tokens, so I should tell this to T5!

I reference the --help in finetune.trainer.py

  --max_target_length MAX_TARGET_LENGTH
                        The maximum total sequence length for target text
                        after tokenization. Sequences longer than this will be
                        truncated, sequences shorter will be padded.
  --val_max_target_length VAL_MAX_TARGET_LENGTH
                        The maximum total sequence length for validation
                        target text after tokenization. Sequences longer than
                        this will be truncated, sequences shorter will be
                        padded.
  --test_max_target_length TEST_MAX_TARGET_LENGTH
                        The maximum total sequence length for test target text
                        after tokenization. Sequences longer than this will be
                        truncated, sequences shorter will be padded.

So it seems these arguments might do something. I can’t personally figure out why these values should be different, I mean, shouldn’t all these values match the maximum prediction length that I want? So I assume that is the case, for the time being.

So next, I run this script:

RUN="sn-vs-n-1-with-target-length"

python3 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_train 1500 \
    --n_val 300 \
    --n_test 100 \
    --num_train_epochs 1 \
    --data_dir "/workspace/rabbit-py/corpii/short_name_vs_name" \
    --model_name_or_path "t5-small" \
    --output_dir "/workspace/rabbit-py/predictions/$RUN" \
    --per_device_train_batch_size 5 \
    --per_device_eval_batch_size 8 \
    --max_target_length 50 \
    --val_max_target_length 50 \
    --test_max_target_length 50 \
    --overwrite_output_dir \
    --run_name $RUN
    "$@"


The only change here are these added arguments:

  --max_target_length 50 \
  --val_max_target_length 50 \
   --test_max_target_length 50 \

… this script finishes… and then I find that the newly generated test_generations.txt are exactly the same!

So, as far as I can tell, these three added arguments have had no effect…!

and… the console output contains the same thing:

"task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },

...

12/18/2020 19:52:35 - INFO - utils -   using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}
12/18/2020 19:52:38 - INFO - __main__ -   *** Train ***

So, my rough guess here is that somehow these max_target_length arguments are being overridden. I re-run the script, this time, removing this line:

--task 'summarization' \

… But once again, I get the same “too long” summary, and in the console, I see the same using task specific params….

So my guess at this point is there might be either a bug or something lacking in the documentation, something that needs to be done to over-ride task specific params when using finetune_trainer.py?

Or (quite possibly) I’m doing something else wrong??

thanks!

2 Likes

From a quick look it appears that your diagnosis might be correct.

I can see how those length args are used to truncate the records in datasets, but model.config remains unmodified, so when it comes to generate it uses the task specific param defaults.

Most likely after use_task_specific_params() is run, model.config needs to be overriden again with user overrides.

So something like:

--- a/examples/seq2seq/finetune_trainer.py
+++ b/examples/seq2seq/finetune_trainer.py
@@ -205,6 +205,10 @@ def main():

     # use task specific params
     use_task_specific_params(model, data_args.task)
+    if model.config.max_length is not None and data_args.max_target_length is not None:
+        print(f"before {model.config.max_length}")
+        model.config.max_length = data_args.max_target_length
+        print(f"after {model.config.max_length}")

     # set num_beams for evaluation
     if data_args.eval_beams is None:

So using your last command line (btw, I think it’s missing --task summarization)

2020-12-18 14:30:47 | INFO | utils | using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}
before 200
after 50

but then there are 3 of those.

but first please test if this one makes a difference.

I’m using cnn_dm db from README.md to test this:

./finetune_trainer.py --learning_rate=3e-5 --fp16 --do_train --do_eval --do_predict \
--evaluation_strategy steps --predict_with_generate --n_train 100 --n_val 100 --n_test 100 \
--num_train_epochs 1 --data_dir cnn_dm --model_name_or_path "t5-small" --output_dir output_dir \
--per_device_train_batch_size 5 --per_device_eval_batch_size 8 --max_target_length 50 \
--val_max_target_length 50 --test_max_target_length 50 --overwrite_output_dir --task summarization

Reading more, it appears that max_target_length and its 3 friends are there specifically to truncate the dataset records, but there are simply no user overrides for generate()s: (edit this is not so, see my later comment as I found it after closer inspection, but the rest of this comment is valid)

  • max_length ( int , optional, defaults to 20) – The maximum length of the sequence to be generated.
  • min_length ( int , optional, defaults to 10) – The minimum length of the sequence to be generated.

So most likely new flags need to be added that override these 2 model.config’s defaults.

I see currently the script overrides these 4 model.config’s defaults:

  --encoder_layerdrop ENCODER_LAYERDROP
                        Encoder layer dropout probability. Goes into
                        model.config.
  --decoder_layerdrop DECODER_LAYERDROP
                        Decoder layer dropout probability. Goes into
                        model.config.
  --dropout DROPOUT     Dropout probability. Goes into model.config.
  --attention_dropout ATTENTION_DROPOUT
                        Attention dropout probability. Goes into model.config.

So perhaps what’s needed is generate_max_length and generate_min_length

or perhaps min_gen_length and max_gen_length to somewhat match:

–max_source_length MAX_SOURCE_LENGTH
–max_target_length MAX_TARGET_LENGTH

as these aren’t very clear on what they apply to.

Awesome, thanks for the reply. That was clearly my confusion, I didn’t realize that the existing max_* arguments were about dataset truncation. Now this makes more sense.

And looking some more again, it looks like val_max_target_length is used in generate and overrides model.config.max_length as you can see here:

So actually there is a working solution now that we know which of the 4 args is used to override max_length.

I double checked that it is so with:

diff --git a/examples/seq2seq/seq2seq_trainer.py b/examples/seq2seq/seq2seq_trainer.py
index 32a96555..7d8f4741 100644
--- a/examples/seq2seq/seq2seq_trainer.py
+++ b/examples/seq2seq/seq2seq_trainer.py
@@ -216,6 +216,10 @@ class Seq2SeqTrainer(Trainer):
             "num_beams": self.data_args.eval_beams if self.data_args is not None else self.config.num_beams,
         }

+        logger.info(f"***** generate args *****")
+        for k, v in sorted(gen_kwargs.items()):
+            logger.info(f"  {k} = {v}")
+
         if self.args.predict_with_generate and not self.args.prediction_loss_only:
             generated_tokens = self.model.generate(
                 inputs["input_ids"],

So getting:

2020-12-18 16:21:38 | INFO | seq2seq_trainer | ***** generate args *****
2020-12-18 16:21:38 | INFO | seq2seq_trainer |   max_length = 50
2020-12-18 16:21:38 | INFO | seq2seq_trainer |   num_beams = 4

So overriding is happening.

But why with self.data_args.val_max_target_length only I don’t know.

So 2 possible things to do here:

  1. either add explicit --min_gen_length and --max_gen_length args and pass those into generate or at the very least document that --val_max_target_length has double usage - one for validation dataset truncation and a secondary use for generate's max_length override.
  2. Perhaps that comment about use task specific params should be amended to say that further overrides may happen since the info logger doesn’t report that model.config.max_length was effectively set to self.data_args.val_max_target_length and thus it is confusing to the user.

I submitted a PR that addresses these 2 items above.

So this flurry of comments cleared out what cl arg to use to override max_length, but I doubt it made any difference to your problem.

If the problem is still unresolved please help us at reproducing it. Ideally use the existing summarization datasets that we use for testing as explained here:

or if that doesn’t work please make a small sample that reproduces the problem with your data and copy-n-paste instructions to get it and deploy it. Thank you!

Thanks for your detailed comments.

I believe I can here provide you with a command to reproduce this issue. (Unless I am somehow confused.)

According to my reading of your comments, this command, using the xsum dataset, should (but does not) generate summaries of fewer than 50 tokens:

RUN="xsum-1500-train-try-max-len"

python3 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_train 1500 \
    --n_val 300 \
    --n_test 100 \
    --num_train_epochs 1 \
    --data_dir "/workspace/rabbit-py/corpii_foreign/xsum" \
    --model_name_or_path "t5-small" \
    --output_dir "/workspace/rabbit-py/predictions/$RUN" \
    --per_device_train_batch_size 5 \
    --per_device_eval_batch_size 8\
    --task 'summarization' \
    --overwrite_output_dir \
    --run_name $RUN \
    --val_max_target_length 50 \
    --test_max_target_length 50 \
    --max_target_length 50 \
    "$@"

When I run this, I get summaries like

the trio are up for best UK act and best album, as well as two nominations in the best song category. they have been nominated for their favourite album, Number One and Strong Again.

Which I think are too long.

I also see these lines in the console output:

INFO - utils - using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}

"task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },

thanks again

As I addressed in the comments above the log is confusing and there is a PR to fix that.

I checked that the code works correctly though, observe:

rm -r output_dir; USE_TF=0 PYTHONPATH="../../src" ./finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_train 150 \
    --n_val 30 \
    --n_test 10 \
    --num_train_epochs 1 \
    --data_dir "xsum" \
    --model_name_or_path "t5-small" \
    --output_dir output_dir \
    --per_device_train_batch_size 5 \
    --per_device_eval_batch_size 8\
    --task 'summarization' \
    --val_max_target_length 50 \
    --test_max_target_length 50 \
    --max_target_length 50

head -1 output_dir/test_generations.txt | wc -w
41

Now let’s set the limit to 5 tokens:

rm -r output_dir; USE_TF=0 PYTHONPATH="../../src" ./finetune_trainer.py \
    --learning_rate=3e-5 \
    --fp16 \
    --do_train --do_eval --do_predict \
    --evaluation_strategy steps \
    --predict_with_generate \
    --n_train 150 \
    --n_val 30 \
    --n_test 10 \
    --num_train_epochs 1 \
    --data_dir "xsum" \
    --model_name_or_path "t5-small" \
    --output_dir output_dir \
    --per_device_train_batch_size 5 \
    --per_device_eval_batch_size 8\
    --task 'summarization' \
    --val_max_target_length 5 \
    --test_max_target_length 5 \
    --max_target_length 5

head -1 output_dir/test_generations.txt | wc -w
4

As you can see --val_max_target_length directly affects the length of the prediction sequence.

  • with 50 tokens max you got a 41 word summary
  • with 5 tokens max you got a 4 word summary

This makes sense to me if very common words are used and they are already in the dictionary as is or big parts of those words.

You can of course go into the code where the tokens are detokenized and print the length of generated tokens, you will most likely see that the max_length matches.

Please let me know whether these comparison runs address your question.

Great, thanks so much for that. The real issue here was my (stupid) assumption about the definition of the token word. For some reason I got into my head that this meant characters, which, of course, it does not. It corresponds closer to words.

Yes, the console output might have been confusing me also.

But this is why I posted to the Beginner channel, obviously.

For anyone reading this thread in the future… use --val_max_target_length and specify the number of tokens you want (think of them as words, not characters!)

thanks

And yes, of course, my own dataset works well, now that I have chosen the right number of tokens

1 Like

@the-pale-king Consider sharing the resulting model if you can!