Hi, thanks for this impressive library - I expect Huggingface to shortly take over the world. This is my first post.
I am using the most recent version of the library, cloned from master, as of 12-16-2020, specifically the code from here: https://github.com/huggingface/transformers/tree/master/examples/seq2seq.
It looks like @stas, and @sgugger have most recently touched this code, and might be best positioned to tell me what stupid mistake I am making.
I am trying to do some summarization with finetune_trainer.py
.
As a proof of concept, I first started with the xsum
dataset, running this shell script:
RUN="xsum-1500-train"
python3 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
--learning_rate=3e-5 \
--fp16 \
--do_train --do_eval --do_predict \
--evaluation_strategy steps \
--predict_with_generate \
--n_train 1500 \
--n_val 300 \
--n_test 100 \
--num_train_epochs 1 \
--data_dir "/workspace/rabbit-py/corpii_foreign/xsum" \
--model_name_or_path "t5-small" \
--output_dir "/workspace/rabbit-py/predictions/$RUN" \
--per_device_train_batch_size 5 \
--per_device_eval_batch_size 8\
--task 'summarization' \
--overwrite_output_dir \
--run_name $RUN
"$@"
This works well, and in about two minutes (using 2x RTX 2070 Super), generates text in the test_generations.txt
output file.
Here is the first line of output in the test_generations.txt
output file:
the trio are up for best UK act and best album, as well as two nominations in the best song category. they have been nominated for their favourite album, Number One and Strong Again.
This is indeed a summary of the originating text, in the first line of test.source
The London trio are up for best UK act and best album, as well as getting two nominations in the best song category."We got told like this morning 'Oh I think you're nominated'", said Dappy."And I was like 'Oh yeah, which one?' And now we've got nominated for four awards. I mean, wow!"Bandmate Fazer added: "We thought it's best of us to come down and mingle with everyone and say hello to the cameras. And now we find we've got four nominations."The band have two shots at the best song prize, getting the nod for their Tynchy Stryder collaboration Number One, and single Strong Again.Their album Uncle B will also go up against records by the likes of Beyonce and Kanye West.N-Dubz picked up the best newcomer Mobo in 2007, but female member Tulisa said they wouldn't be too disappointed if they didn't win this time around."At the end of the day we're grateful to be where we are in our careers."If it don't happen then it don't happen - live to fight another day and keep on making albums and hits for the fans."Dappy also revealed they could be performing live several times on the night.The group will be doing Number One and also a possible rendition of the War Child single, I Got Soul.The charity song is a re-working of The Killers' All These Things That I've Done and is set to feature artists like Chipmunk, Ironik and Pixie Lott.This year's Mobos will be held outside of London for the first time, in Glasgow on 30 September.N-Dubz said they were looking forward to performing for their Scottish fans and boasted about their recent shows north of the border."We just done Edinburgh the other day," said Dappy."We smashed up an N-Dubz show over there. We done Aberdeen about three or four months ago - we smashed up that show over there! Everywhere we go we smash it up!"
So far, so good.
Note that I am running only 1 epoch, and only using a small fraction of the data, because at this point I only want to do a proof of concept.
I now want to try a proof of concept on my own data.
My own training data uses shorter length text, both in input, and output.
For my training data I am trying to summarize text like:
HSH Solid Wood Bookshelf, 2 Tier Rustic Vintage Industrial Etagere Bookcase, Open Metal Farmhouse Book Shelf, Distressed Brown
…and end up with a summary like:
HSH Rustic Industrial
This example, as you can see, happens to fit the description of an “extractive” summarization, where all the text in the training target is included in the training source, but not all of my rows are like that – many of my rows might require something closer to an “abstractive” summarization. (Just FYI).
So, as a proof of concept, I now try a minimal modification of my script, just putting in my data directory, instead of xsum:
RUN="sn-vs-n-1-simple"
python3 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
--learning_rate=3e-5 \
--fp16 \
--do_train --do_eval --do_predict \
--evaluation_strategy steps \
--predict_with_generate \
--n_train 1500 \
--n_val 300 \
--n_test 100 \
--num_train_epochs 1 \
--data_dir "/workspace/rabbit-py/corpii/short_name_vs_name" \
--model_name_or_path "t5-small" \
--output_dir "/workspace/rabbit-py/predictions/$RUN" \
--per_device_train_batch_size 5 \
--per_device_eval_batch_size 8\
--task 'summarization' \
--overwrite_output_dir \
--run_name $RUN
"$@"
I run this, and… given this first line of test.source
…
Bloggerlove Rain Jacket Women Lightweight Raincoat Waterproof Windbreaker Striped Climbing Outdoor Hooded Trench Coats S-Xxl
… the first line of test_generations.txt
is:
Bloggerlove Rain Jacket Women Lightweight Raincoat Waterproof Windbreaker Striped Climbing Outdoor Hooded Trench Coats S-Xxl
… whereas the first line of test.target
is:
Bloggerlove Hooded Trench Coat
… and the second line of test.source
is:
Sony Portable Bluetooth Digital Turner AM/FM CD Player Mega Bass Reflex Stereo Sound System
… and the second line of test_generations.txt
is:
Sony Portable Bluetooth Digital Turner AM/FM CD Player Mega Bass Reflex Stereo Sound System. Sony portable Bluetooth digital Turner MP/FM MP3 player Mega bass Reflex stereo sound system.
…whereas the second line of test.target
is:
Sony Bluetooth
So clearly this is not working right!
At a most basic level, the summaries are too long… and actually, it seems that T5 is hallucinating “additional” text to add to my input text!
So, my first stop is to look at the console output:
2/18/2020 19:28:54 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 2, distributed training: False, 16-bits training: True
12/18/2020 19:28:54 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/workspace/rabbit-py/predictions/sn-vs-n-1-simple', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, model_parallel=False, evaluation_strategy=<EvaluationStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=5, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Dec18_19-28-54_94c29ef5e746', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='sn-vs-n-1-simple', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, label_smoothing=0.0, sortish_sampler=False, predict_with_generate=True, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
[INFO|configuration_utils.py:422] 2020-12-18 19:28:54,234 >> loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /workspace/rabbit-py/models_foreign/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
[INFO|configuration_utils.py:458] 2020-12-18 19:28:54,236 >> Model config T5Config {
"architectures": [
"T5WithLMHeadModel"
],
"d_ff": 2048,
"d_kv": 64,
"d_model": 512,
"decoder_start_token_id": 0,
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "relu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"layer_norm_epsilon": 1e-06,
"model_type": "t5",
"n_positions": 512,
"num_decoder_layers": 6,
"num_heads": 8,
"num_layers": 6,
"output_past": true,
"pad_token_id": 0,
"relative_attention_num_buckets": 32,
"task_specific_params": {
"summarization": {
"early_stopping": true,
"length_penalty": 2.0,
"max_length": 200,
"min_length": 30,
"no_repeat_ngram_size": 3,
"num_beams": 4,
"prefix": "summarize: "
},
"translation_en_to_de": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to German: "
},
"translation_en_to_fr": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to French: "
},
"translation_en_to_ro": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to Romanian: "
}
},
"use_cache": true,
"vocab_size": 32128
}
[INFO|configuration_utils.py:422] 2020-12-18 19:28:54,466 >> loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /workspace/rabbit-py/models_foreign/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
[INFO|configuration_utils.py:458] 2020-12-18 19:28:54,467 >> Model config T5Config {
"architectures": [
"T5WithLMHeadModel"
],
"d_ff": 2048,
"d_kv": 64,
"d_model": 512,
"decoder_start_token_id": 0,
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "relu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"layer_norm_epsilon": 1e-06,
"model_type": "t5",
"n_positions": 512,
"num_decoder_layers": 6,
"num_heads": 8,
"num_layers": 6,
"output_past": true,
"pad_token_id": 0,
"relative_attention_num_buckets": 32,
"task_specific_params": {
"summarization": {
"early_stopping": true,
"length_penalty": 2.0,
"max_length": 200,
"min_length": 30,
"no_repeat_ngram_size": 3,
"num_beams": 4,
"prefix": "summarize: "
},
"translation_en_to_de": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to German: "
},
"translation_en_to_fr": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to French: "
},
"translation_en_to_ro": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to Romanian: "
}
},
"use_cache": true,
"vocab_size": 32128
}
[INFO|tokenization_utils_base.py:1793] 2020-12-18 19:28:54,944 >> loading file https://huggingface.co/t5-small/resolve/main/spiece.model from cache at /workspace/rabbit-py/models_foreign/65fc04e21f45f61430aea0c4fedffac16a4d20d78b8e6601d8d996ebefefecd2.3b69006860e7b5d0a63ffdddc01ddcd6b7c318a6f4fd793596552c741734c62d
[INFO|tokenization_utils_base.py:1793] 2020-12-18 19:28:54,944 >> loading file https://huggingface.co/t5-small/resolve/main/tokenizer.json from cache at /workspace/rabbit-py/models_foreign/06779097c78e12f47ef67ecb728810c2ae757ee0a9efe9390c6419783d99382d.8627f1bd5d270a9fd2e5a51c8bec3223896587cc3cfe13edeabb0992ab43c529
[INFO|modeling_utils.py:1014] 2020-12-18 19:28:55,263 >> loading weights file https://huggingface.co/t5-small/resolve/main/pytorch_model.bin from cache at /workspace/rabbit-py/models_foreign/fee5a3a0ae379232608b6eed45d2d7a0d2966b9683728838412caccc41b4b0ed.ddacdc89ec88482db20c676f0861a336f3d0409f94748c209847b49529d73885
[WARNING|modeling_utils.py:1122] 2020-12-18 19:28:56,647 >> Some weights of the model checkpoint at t5-small were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[INFO|modeling_utils.py:1139] 2020-12-18 19:28:56,647 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
12/18/2020 19:28:56 - INFO - utils - using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}
12/18/2020 19:28:58 - INFO - __main__ - *** Train ***
[INFO|trainer.py:668] 2020-12-18 19:28:58,881 >> ***** Running training *****
[INFO|trainer.py:669] 2020-12-18 19:28:58,881 >> Num examples = 1500
[INFO|trainer.py:670] 2020-12-18 19:28:58,881 >> Num Epochs = 1
[INFO|trainer.py:671] 2020-12-18 19:28:58,881 >> Instantaneous batch size per device = 5
[INFO|trainer.py:672] 2020-12-18 19:28:58,881 >> Total train batch size (w. parallel, distributed & accumulation) = 10
[INFO|trainer.py:673] 2020-12-18 19:28:58,881 >> Gradient Accumulation steps = 1
[INFO|trainer.py:674] 2020-12-18 19:28:58,881 >> Total optimization steps = 150
sn-vs-n-1-simple
[INFO|integrations.py:360] 2020-12-18 19:28:58,898 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: ghengis (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.12
wandb: Syncing run sn-vs-n-1-simple
wandb: ⭐ View project at https://wandb.ai/---/huggingface
wandb: 🚀 View run at https://wandb.ai/----/huggingface/runs/tu1t9h5g
wandb: Run data is saved locally in /workspace/rabbit-py/src/learning/wandb/run-20201218_192859-tu1t9h5g
wandb: Run `wandb offline` to turn off syncing.
0%| | 0/150 [00:00<?, ?it/s]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
100%|██████████| 150/150 [00:42<00:00, 3.53it/s][INFO|trainer.py:821] 2020-12-18 19:29:43,464 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
{'epoch': 1.0}
100%|██████████| 150/150 [00:42<00:00, 3.49it/s]
[INFO|trainer.py:1183] 2020-12-18 19:29:43,467 >> Saving model checkpoint to /workspace/rabbit-py/predictions/sn-vs-n-1-simple
[INFO|configuration_utils.py:289] 2020-12-18 19:29:43,471 >> Configuration saved in /workspace/rabbit-py/predictions/sn-vs-n-1-simple/config.json
[INFO|modeling_utils.py:814] 2020-12-18 19:29:43,893 >> Model weights saved in /workspace/rabbit-py/predictions/sn-vs-n-1-simple/pytorch_model.bin
12/18/2020 19:29:43 - INFO - __main__ - ***** train metrics *****
12/18/2020 19:29:43 - INFO - __main__ - train_samples_per_second = 33.64
12/18/2020 19:29:43 - INFO - __main__ - train_runtime = 44.5896
12/18/2020 19:29:43 - INFO - __main__ - train_n_ojbs = 1500
12/18/2020 19:29:43 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:1369] 2020-12-18 19:29:43,950 >> ***** Running Evaluation *****
[INFO|trainer.py:1370] 2020-12-18 19:29:43,950 >> Num examples = 300
[INFO|trainer.py:1371] 2020-12-18 19:29:43,951 >> Batch size = 16
100%|██████████| 19/19 [00:20<00:00, 1.06s/it]
12/18/2020 19:30:05 - INFO - __main__ - ***** val metrics *****
12/18/2020 19:30:05 - INFO - __main__ - val_loss = 2.1546
12/18/2020 19:30:05 - INFO - __main__ - val_rouge1 = 26.3976
12/18/2020 19:30:05 - INFO - __main__ - val_rouge2 = 13.6039
12/18/2020 19:30:05 - INFO - __main__ - val_rougeL = 26.1308
12/18/2020 19:30:05 - INFO - __main__ - val_rougeLsum = 26.18
12/18/2020 19:30:05 - INFO - __main__ - val_gen_len = 37.9
12/18/2020 19:30:05 - INFO - __main__ - epoch = 1.0
12/18/2020 19:30:05 - INFO - __main__ - val_samples_per_second = 14.091
12/18/2020 19:30:05 - INFO - __main__ - val_runtime = 21.29
12/18/2020 19:30:05 - INFO - __main__ - val_n_ojbs = 300
12/18/2020 19:30:05 - INFO - __main__ - *** Predict ***
[INFO|trainer.py:1369] 2020-12-18 19:30:05,241 >> ***** Running Prediction *****
[INFO|trainer.py:1370] 2020-12-18 19:30:05,241 >> Num examples = 100
[INFO|trainer.py:1371] 2020-12-18 19:30:05,241 >> Batch size = 16
100%|██████████| 7/7 [00:05<00:00, 1.23it/s]12/18/2020 19:30:12 - INFO - __main__ - ***** test metrics *****
12/18/2020 19:30:12 - INFO - __main__ - test_loss = 2.2199
12/18/2020 19:30:12 - INFO - __main__ - test_rouge1 = 27.7161
12/18/2020 19:30:12 - INFO - __main__ - test_rouge2 = 13.4332
12/18/2020 19:30:12 - INFO - __main__ - test_rougeL = 27.8038
12/18/2020 19:30:12 - INFO - __main__ - test_rougeLsum = 27.7593
12/18/2020 19:30:12 - INFO - __main__ - test_gen_len = 37.2
12/18/2020 19:30:12 - INFO - __main__ - test_samples_per_second = 13.715
12/18/2020 19:30:12 - INFO - __main__ - test_runtime = 7.2913
12/18/2020 19:30:12 - INFO - __main__ - test_n_ojbs = 100
100%|██████████| 7/7 [00:06<00:00, 1.14it/s]
… And something that jumps out at me is this:
12/18/2020 19:28:56 - INFO - utils - using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}
It seems I am using task specific params
, which are asking the model for a max_legnth of 200 tokens, right?
And then I see this Github comment:
As @aychang95 suggested you have to play around with the generate method arguments to see what works best for your example. Especially take a look at num_beams, max_length, min_length, early_stopping and length_penalty.
So my idea is: I should shorten this max_length. My target summaries never go over 50 tokens, so I should tell this to T5!
I reference the --help
in finetune.trainer.py
--max_target_length MAX_TARGET_LENGTH
The maximum total sequence length for target text
after tokenization. Sequences longer than this will be
truncated, sequences shorter will be padded.
--val_max_target_length VAL_MAX_TARGET_LENGTH
The maximum total sequence length for validation
target text after tokenization. Sequences longer than
this will be truncated, sequences shorter will be
padded.
--test_max_target_length TEST_MAX_TARGET_LENGTH
The maximum total sequence length for test target text
after tokenization. Sequences longer than this will be
truncated, sequences shorter will be padded.
So it seems these arguments might do something. I can’t personally figure out why these values should be different, I mean, shouldn’t all these values match the maximum prediction length that I want? So I assume that is the case, for the time being.
So next, I run this script:
RUN="sn-vs-n-1-with-target-length"
python3 /workspace/rabbit-py/transformers/examples/seq2seq/finetune_trainer.py \
--learning_rate=3e-5 \
--fp16 \
--do_train --do_eval --do_predict \
--evaluation_strategy steps \
--predict_with_generate \
--n_train 1500 \
--n_val 300 \
--n_test 100 \
--num_train_epochs 1 \
--data_dir "/workspace/rabbit-py/corpii/short_name_vs_name" \
--model_name_or_path "t5-small" \
--output_dir "/workspace/rabbit-py/predictions/$RUN" \
--per_device_train_batch_size 5 \
--per_device_eval_batch_size 8 \
--max_target_length 50 \
--val_max_target_length 50 \
--test_max_target_length 50 \
--overwrite_output_dir \
--run_name $RUN
"$@"
The only change here are these added arguments:
--max_target_length 50 \
--val_max_target_length 50 \
--test_max_target_length 50 \
… this script finishes… and then I find that the newly generated test_generations.txt
are exactly the same!
So, as far as I can tell, these three added arguments have had no effect…!
and… the console output contains the same thing:
"task_specific_params": {
"summarization": {
"early_stopping": true,
"length_penalty": 2.0,
"max_length": 200,
"min_length": 30,
"no_repeat_ngram_size": 3,
"num_beams": 4,
"prefix": "summarize: "
},
...
12/18/2020 19:52:35 - INFO - utils - using task specific params for summarization: {'early_stopping': True, 'length_penalty': 2.0, 'max_length': 200, 'min_length': 30, 'no_repeat_ngram_size': 3, 'num_beams': 4, 'prefix': 'summarize: '}
12/18/2020 19:52:38 - INFO - __main__ - *** Train ***
So, my rough guess here is that somehow these max_target_length
arguments are being overridden. I re-run the script, this time, removing this line:
--task 'summarization' \
… But once again, I get the same “too long” summary, and in the console, I see the same using task specific params….
So my guess at this point is there might be either a bug or something lacking in the documentation, something that needs to be done to over-ride task specific params
when using finetune_trainer.py
?
Or (quite possibly) I’m doing something else wrong??
thanks!