Reformer Model /Fixed default num_bucket


Following the PR, if the num_bucket hyperpameter is not provided, the num_bucket is computed to a good value as per the formula provided by the paper (see page 4 from

However, from my understanding, the implementation in HuggingFace uses the sequence length from the 1st batch instead of using the maximum sequence length. This would result in lower num_bucket than expected by the formula from the paper.

Is there a reason to have a slightly different implementation than the paper ?

Please let me know if I’m missing something