Finetuning Sequence-Pairs (GLUE) with higher sequence lengths seems to fail?

:question: Questions & Help

Details

I have an issue where I tried to use the standard GLUE finetuning script for task STS-B with longer sequence lengths and the results are bad (see below). Correlation decreases massively when using longer sequence lengths, and when I instead use binary classification with two classes instead of regression, it is the same situation. For 128 and with some data (e. g. Yelp) 256 works well but longer sequence lengths then simply fail.
My assumption was that longer sequence lengths should results in similar or sometimes better results and that for shorter input sequences padding is added but not incorporated into the embedding because of masking (where the input sequence is and where it is not)?

Initially, I was using the Yelp Business Review dataset for sentiment prediction (which worked well for sequence lengths of 128, 256 and 512) but pairing same reviews sentiments for the same business should be similar to sequence pair classification (I know the task/data works) but it only gave good results for a sequence length of 128 and 256, but 400 or 512 just predicted zeros (as far as I observed). I then tried to just use this with the GLUE STS-B data with the same issue happening.

Background:
Before that, I was using GluonNLP (MXNet) and the BERT demo finetuning script (also GLUE STS-B like) with the same data and basically same framework/workflow (even hyperparameters) as here in PyTorch but there all sequence lengths worked, and longer sequence length even improved results (even with smaller batch sizes because of GPU RAM and longer training durations). As the input texts were were smaller and longer (about a third of the data, I guess) this was not that surprising. I’m currently trying to switch to transformers because of the larger choice and support of models…

So, what am I doing wrong?
I tried using a constant learning rate schedule (using the default learning rate in the code) but it gave no improvements.
I tried different datasets also with almost similar end results. (even if input texts were longer than the maximum sequence length)

Can others reproduce this? (Just switch to seqlen 512 and batchsize 8 / seqlen 256 and batchsize 16)
Do I have to choose another padding strategy?


Results on GeForce RTX 2080 with transformers version 3.3.1 and CUDA 10.2:

# my script args (basically just changing the output dir and the sequence length (batch size for GPU memory reasons))
# transformers_copy being the cloned repo root folder
export GLUE_DIR=data/glue
export TASK_NAME=STS-B
python transformers_copy/examples/text-classification/run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --data_dir data/sentiment/yelp-pair-b/   --max_seq_length 128 --per_device_train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir output/glue_yelp_128_32
CUDA_VISIBLE_DEVICES=1 python transformers_copy/examples/text-classification/run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --data_dir data/glue/STS-B/   --max_seq_length 256 --per_device_train_batch_size 16 --per_device_eval_batch_size 16  --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir output/glue_STS-B_256_16 --save_steps 1000
CUDA_VISIBLE_DEVICES=1 python transformers_copy/examples/text-classification/run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --data_dir data/glue/STS-B/   --max_seq_length 512 --per_device_train_batch_size 8 --per_device_eval_batch_size 8  --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir output/glue_STS-B_512_8 --save_steps 2000
# cat glue_STS-B_128_32/eval_results_sts-b.txt
# seqlen 128
eval_loss = 0.5857866220474243
eval_pearson = 0.8675888610991327
eval_spearmanr = 0.8641174656753431
eval_corr = 0.865853163387238
epoch = 3.0
total_flos = 1434655122529536
# cat glue_STS-B_256_16/eval_results_sts-b.txt
# seqlen 256
# this result should be bad, as far as I would think
eval_loss = 2.2562920122146606
eval_pearson = 0.22274851498729242
eval_spearmanr = 0.09065396938535858
eval_corr = 0.1567012421863255
epoch = 3.0
total_flos = 2869310245059072
# cat glue_STS-B_512_8/eval_results_sts-b.txt
# seqlen 512
eval_loss = 2.224635926246643
eval_pearson = 0.24041184048438544
eval_spearmanr = 0.08133980923357159
eval_corr = 0.1608758248589785
epoch = 3.0
total_flos = 5738620490118144

Yelp (sentiment, single sequence) with sequence length of 512

# cat yelp-sentiment-b_512_16_1/eval_results_sent-b.txt
eval_loss = 0.2301591751359403
eval_acc = 0.92832
eval_f1 = 0.945765994794504
eval_acc_and_f1 = 0.937042997397252
eval_pearson = 0.8404006160382227
eval_spearmanr = 0.8404006160382247
eval_corr = 0.8404006160382237
eval_class_report = {'not same': {'precision': 0.9099418011639767, 'recall': 0.8792393761957215, 'f1-score': 0.8943271612218422, 'support': 17249}, 'same': {'precision': 0.937509375093751, 'recall': 0.954169338340814, 'f1-score': 0.945765994794504, 'support': 32751}, 'accuracy': 0.92832, 'macro avg': {'precision': 0.9237255881288639, 'recall': 0.9167043572682677, 'f1-score': 0.920046578008173, 'support': 50000}, 'weighted avg': {'precision': 0.9279991134394574, 'recall': 0.92832, 'f1-score': 0.928020625988607, 'support': 50000}}
epoch = 0.08
total_flos = 26906733281280000

Yelp (sequence pairs) with 128, 256 and 512 (were 512 fails)

# cat yelp-pair-b_128_32_3/eval_results_same-b.txt
# seqlen 128
eval_loss = 0.4788903475597093
eval_acc = 0.8130612708878027
eval_f1 = 0.8137388152678672
eval_acc_and_f1 = 0.813400043077835
eval_pearson = 0.6262220422479998
eval_spearmanr = 0.6262220422479998
eval_corr = 0.6262220422479998
eval_class_report = {'not same': {'precision': 0.8189660129967221, 'recall': 0.8058966668552996, 'f1-score': 0.8123787792355962, 'support': 35342}, 'same': {'precision': 0.8072925445249733, 'recall': 0.8202888622481018, 'f1-score': 0.8137388152678672, 'support': 35034}, 'accuracy': 0.8130612708878027, 'macro avg': {'precision': 0.8131292787608477, 'recall': 0.8130927645517008, 'f1-score': 0.8130587972517317, 'support': 70376}, 'weighted avg': {'precision': 0.8131548231814548, 'recall': 0.8130612708878027, 'f1-score': 0.8130558211583339, 'support': 70376}}
epoch = 3.0
total_flos = 71009559802626048
# cat yelp-pair-b_256_16_1/eval_results_same-b.txt
# seqlen 256
eval_loss = 0.3369856428101318
eval_acc = 0.8494088893941116
eval_f1 = 0.8505977218901545
eval_acc_and_f1 = 0.850003305642133
eval_pearson = 0.6990572001217541
eval_spearmanr = 0.6990572001217481
eval_corr = 0.6990572001217511
eval_class_report = {'not same': {'precision': 0.8588791553054476, 'recall': 0.8377850715862147, 'f1-score': 0.8482009854474619, 'support': 35342}, 'same': {'precision': 0.840315302768648, 'recall': 0.8611348975281156, 'f1-score': 0.8505977218901545, 'support': 35034}, 'accuracy': 0.8494088893941116, 'macro avg': {'precision': 0.8495972290370477, 'recall': 0.8494599845571651, 'f1-score': 0.8493993536688083, 'support': 70376}, 'weighted avg': {'precision': 0.8496378513129752, 'recall': 0.8494088893941116, 'f1-score': 0.8493941090198912, 'support': 70376}}
epoch = 1.0
total_flos = 47339706535084032
# cat yelp-pair-b_512_8_3/eval_results_same-b.txt
# seqlen 512
# here it basically just predicts zeros all the time (as fas as I saw)
eval_loss = 0.6931421184073636
eval_acc = 0.5021882459929522
eval_f1 = 0.0
eval_acc_and_f1 = 0.2510941229964761
eval_pearson = nan
eval_spearmanr = nan
eval_corr = nan
eval_class_report = {'not same': {'precision': 0.5021882459929522, 'recall': 1.0, 'f1-score': 0.6686089407669461, 'support': 35342}, 'same': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 35034}, 'accuracy': 0.5021882459929522, 'macro avg': {'precision': 0.2510941229964761, 'recall': 0.5, 'f1-score': 0.33430447038347305, 'support': 70376}, 'weighted avg': {'precision': 0.25219303441347785, 'recall': 0.5021882459929522, 'f1-score': 0.3357675512189583, 'support': 70376}}
epoch = 3.0
total_flos = 284038239210504192

Side note:
I also ran Yelp with regression and it worked for 128 but for 512 the correlation was below 0.3 so it also failed again.
And I worked on another (private) dataset with similar results…

A short reply for now. More details will follow.

It seems to slightly depend on the batch size. Using gradient accumulation to augment the smaller batch sizes (of longer sequence length) corrects my issue where the model previously only predicted 1 or 0. So it might be the case that it just can’t generalize enough if the batch size is too small.

But this is still somehow connected to the learning rate and optimizer (I would assume), as with an older implementation of BERT finetuning in MXNet I could train with batch sizes of 2 and still be on the same level as with shorter sequence lengths or even better. The current finetuning code from GluonNLP (MXNet) seems to have a similar issue with the longer sequence lengths/smaller batch sizes, and gradient accumulation helped here, too. So finding the changes compared to the older code might help find the root cause.