Fine tuned Mistral 7B inference issue for >4k context length token with transformer 4.35+

We fine tuned a mistralai/Mistral-7B-Instruct-v0.1using LoRa on some 8k context length data. The inferencing was fine with transformer at 4.34.0, but after updating the version, the inferencing became irrelevant repeition for token length > 4096.

We were able to get around this by disabling fast attention 2, but the overall model performance suffered. Has anyone run into similar issue with longer context length? Or any suggestions of what could solve this problem?

There were two major version updates after 4.34.0:
4.35.x implemented the 4D attention mask during inferencing to accomdoate some changes made from fast attention team Mistral: CUDA error when generating text with a batch of inputs 路 Issue #27908 路 huggingface/transformers 路 GitHub

4.36.x (main branch) fixed an error caused by the 4d attention mask implemntation for longer prompt Fix mistral generate for long prompt / response by lorabit110 路 Pull Request #27548 路 huggingface/transformers 路 GitHub