I’ve tried to train a Reformer model with a toy task as described in the original paper: a sequence of 1024 tokens, such that the first half is the same as the second.
I tried replicating the configurations given by the authors (one LSH layer, 256 hidden units, 4 attention heads), and it still took me some tuning with the learning rate until I got around 99.4% accuracy on the dev set with 4 hashing rounds, still slightly below their 99.9%.
But worse is the very unstable loss curve. It falls, then bounces back and actually tends to increase. Here is the training loss plot:
I compared with a deterministic setting, with a single local (no LSH) attention layer and context length to the size of the inputs (1024). It was much more stable, accuracy went to 1.0 and stayed there.
So, even with a random component, I find it hard to understand why the Reformer loss should be so unstable. Some experiments with enwik8 yielded even worse results.
Do you think there might be something wrong in the LSH implementation or this is just a super sensitive model?
I wouldn’t be surprised if this was yet another sensitive model, similar to ALBERT which never gave me good results on any of my tasks. I communicated about this with many other researchers, all facing the same issue. However, I cannot be sure whether that is also the case here…
For larger scale datasets, yes… but this being a simple toy problem, I expected something higher would work, and it in fact did for a full attention model. I tried again with 1e-4 and it worked a lot better, but still with a small bounce in the training loss.
I think the model is just super sensitive to hyperparameters. I tried a vanilla Transformer decoder and it worked well with a wider range of HPs, but the Reformer needed a lot of tuning.