Hi,
Following the PR https://github.com/huggingface/transformers/pull/4564, if the num_bucket hyperpameter is not provided, the num_bucket is computed to a good value as per the formula provided by the paper (see page 4 from https://arxiv.org/pdf/2001.04451.pdf).
However, from my understanding, the implementation in HuggingFace uses the sequence length from the 1st batch instead of using the maximum sequence length. This would result in lower num_bucket than expected by the formula from the paper.
Is there a reason to have a slightly different implementation than the paper ?
Please let me know if I’m missing something
Thanks
François