In the modeling_llama.py file, I can see that the rotary embeddings are applied (at line 271)after transposing the time and head dimension for keys and values (done at lines 267 and 268). In the official meta implementation (llama/llama/model.py at main · meta-llama/llama · GitHub) the rotary embeddings are applied before the transpositions. Why is this? I am asking because when I reimplement the model in Pytorch and load the weights, I get slightly different token distributions compared to when I load a llama model using the transformers library.
1 Like
There are also development staff who browse the forum, but I think it’s better to raise issues on github if you have any questions about the implementation of the library.