Unstable Reformer training on toy task

I’ve tried to train a Reformer model with a toy task as described in the original paper: a sequence of 1024 tokens, such that the first half is the same as the second.

I tried replicating the configurations given by the authors (one LSH layer, 256 hidden units, 4 attention heads), and it still took me some tuning with the learning rate until I got around 99.4% accuracy on the dev set with 4 hashing rounds, still slightly below their 99.9%.

But worse is the very unstable loss curve. It falls, then bounces back and actually tends to increase. Here is the training loss plot:
image

I compared with a deterministic setting, with a single local (no LSH) attention layer and context length to the size of the inputs (1024). It was much more stable, accuracy went to 1.0 and stayed there.

So, even with a random component, I find it hard to understand why the Reformer loss should be so unstable. Some experiments with enwik8 yielded even worse results.

Do you think there might be something wrong in the LSH implementation or this is just a super sensitive model?

Can you share a colab link for this? It would be easier to debug.

Some additional questions:

  • Do you use FP16?
  • How did you tuned the learning rate?
  • Do you use gradient clipping?
  • What is the batch size?

I wouldn’t be surprised if this was yet another sensitive model, similar to ALBERT which never gave me good results on any of my tasks. I communicated about this with many other researchers, all facing the same issue. However, I cannot be sure whether that is also the case here…

Sure. Here it is: Google Colab

No

I just started with 0.001 and tried reducing it. The full attention model was a lot less sensitive to it.

Yes, of 1. I’m using the Trainer class from transformers.

  1. I tried larger values too, and didn’t notice much difference.

LMs usually have much lower learning rate than 0.001 (i.e BERT has 5e-5), try finding good starting LR by using LR Find algorithm.

+: Colab is not public, could you make it viewable?

For larger scale datasets, yes… but this being a simple toy problem, I expected something higher would work, and it in fact did for a full attention model. I tried again with 1e-4 and it worked a lot better, but still with a small bounce in the training loss.

The Colab is public now.

@erickrf have you root caused the issue?

I think the model is just super sensitive to hyperparameters. I tried a vanilla Transformer decoder and it worked well with a wider range of HPs, but the Reformer needed a lot of tuning.

That’s good to know, thanks!