Unstable Reformer training on toy task

erickrf · August 4, 2020, 9:30pm

I’ve tried to train a Reformer model with a toy task as described in the original paper: a sequence of 1024 tokens, such that the first half is the same as the second.

I tried replicating the configurations given by the authors (one LSH layer, 256 hidden units, 4 attention heads), and it still took me some tuning with the learning rate until I got around 99.4% accuracy on the dev set with 4 hashing rounds, still slightly below their 99.9%.

But worse is the very unstable loss curve. It falls, then bounces back and actually tends to increase. Here is the training loss plot:

I compared with a deterministic setting, with a single local (no LSH) attention layer and context length to the size of the inputs (1024). It was much more stable, accuracy went to 1.0 and stayed there.

So, even with a random component, I find it hard to understand why the Reformer loss should be so unstable. Some experiments with enwik8 yielded even worse results.

Do you think there might be something wrong in the LSH implementation or this is just a super sensitive model?

marrrcin · August 5, 2020, 11:12am

Can you share a colab link for this? It would be easier to debug.

Some additional questions:

Do you use FP16?
How did you tuned the learning rate?
Do you use gradient clipping?
What is the batch size?

BramVanroy · August 5, 2020, 4:24pm

I wouldn’t be surprised if this was yet another sensitive model, similar to ALBERT which never gave me good results on any of my tasks. I communicated about this with many other researchers, all facing the same issue. However, I cannot be sure whether that is also the case here…

erickrf · August 5, 2020, 7:21pm

Sure. Here it is: Google Colab

No

I just started with 0.001 and tried reducing it. The full attention model was a lot less sensitive to it.

Yes, of 1. I’m using the Trainer class from transformers.

I tried larger values too, and didn’t notice much difference.

marrrcin · August 6, 2020, 12:08pm

LMs usually have much lower learning rate than 0.001 (i.e BERT has 5e-5), try finding good starting LR by using LR Find algorithm.

+: Colab is not public, could you make it viewable?

erickrf · August 6, 2020, 12:23pm

For larger scale datasets, yes… but this being a simple toy problem, I expected something higher would work, and it in fact did for a full attention model. I tried again with 1e-4 and it worked a lot better, but still with a small bounce in the training loss.

The Colab is public now.

marrrcin · September 11, 2020, 11:54am

@erickrf have you root caused the issue?

erickrf · September 11, 2020, 2:24pm

I think the model is just super sensitive to hyperparameters. I tried a vanilla Transformer decoder and it worked well with a wider range of HPs, but the Reformer needed a lot of tuning.

marrrcin · September 14, 2020, 8:59am

That’s good to know, thanks!

Topic		Replies	Views
Does anyone else observer RoBERTa fine-tuning instability? 🤗Transformers	8	3119	April 20, 2023
Loss behaviour for bert fine-tuning on QNLI Models	3	4432	October 15, 2021
[Nov 16th Event] Lewis Tunstall: Simple Training with the 🤗 Transformers Trainer Course	12	501	November 16, 2021
TinyReformer/TinyLongformer details Models	3	432	November 6, 2020
Bert LM pretraining: training loss goes to 0 at masking probability of 0.999 Beginners	2	2320	October 31, 2020

Unstable Reformer training on toy task

Related topics