I’ve tried to train a Reformer model with a toy task as described in the original paper: a sequence of 1024 tokens, such that the first half is the same as the second.
I tried replicating the configurations given by the authors (one LSH layer, 256 hidden units, 4 attention heads), and it still took me some tuning with the learning rate until I got around 99.4% accuracy on the dev set with 4 hashing rounds, still slightly below their 99.9%.
But worse is the very unstable loss curve. It falls, then bounces back and actually tends to increase. Here is the training loss plot:
I compared with a deterministic setting, with a single local (no LSH) attention layer and context length to the size of the inputs (1024). It was much more stable, accuracy went to 1.0 and stayed there.
So, even with a random component, I find it hard to understand why the Reformer loss should be so unstable. Some experiments with enwik8 yielded even worse results.
Do you think there might be something wrong in the LSH implementation or this is just a super sensitive model?