How to Modify LLaMA 2 Model for Internal Token Generation Timing

Hi all,
I am currently working with the LLaMA 2 model and need to measure the time it takes to generate each token directly within the model’s computation loop. My goal is to include precise timing of the token generation process inside the model’s forward method to better analyze the performance and computational cost at a granular level.

I understand that this involves modifying the internal workings of the model, specifically around the logits computation for each token. However, I am unsure of the best approach to achieve this without disrupting the model’s performance and functionality.

Could anyone provide insights or examples on how to safely integrate timing into the forward pass of the LLaMA model? I am particularly interested in any best practices for modifying the forward method to include time measurements while ensuring the model remains stable and efficient.

Any guidance or suggestions would be greatly appreciated!
Thank you!