Practicality and Efficiency of Using Non-Power-of-Two Context Lengths in Fine-Tuning Hugging Face Models for SFT or Fine-Tuning

brando · August 8, 2024, 12:46am

I am considering increasing the context length of my model for fine-tuning, but I am curious about the practicality and efficiency of using non-power-of-two context lengths. Specifically, I would like to know:

Efficiency Implications: How does using a non-power-of-two context length impact memory usage and computation efficiency? Are there specific optimizations that are lost or become less effective with non-power-of-two lengths?
Model Performance: Have there been any observed impacts on model performance, such as generation quality or convergence speed, when using non-power-of-two context lengths?
Best Practices: What are the recommended practices for increasing context length beyond typical values (e.g., 2048 or 4096)? Should increments be made gradually, and are there specific configurations or techniques (e.g., Sample Packing, Flash Attention) that can help manage potential inefficiencies?

I have found some references indicating that frameworks are optimized for powers of two and that non-power-of-two lengths might lead to inefficiencies and potential performance degradation. However, practical insights and experiences from the community would be incredibly valuable.

References:

Everything About Long Context Fine-tuning discusses memory and efficiency considerations in long-context fine-tuning.
Hugging Face Forums - Practicality of Context Length highlights potential issues with exceeding typical context lengths.
MosaicML Documentation on Fine-tuning provides a tutorial on fine-tuning Hugging Face models, including efficiency tips.

Also, I have seen batch sizes not powers of two so are context lengths special? I conjecture no)

Topic		Replies	Views
Is it ok to have max_length greater than context_length of the model 🤗Transformers	0	344	March 15, 2024
Issue with max_length 🤗Transformers	1	2486	September 27, 2020
Question About the Practicality of the Context Length Models	3	7522	August 8, 2024
Fine tuned Mistral 7B inference issue for >4k context length token with transformer 4.35+ 🤗Transformers	0	562	December 11, 2023
Is there a pre-trained BERT model with the sequence length 2048? 🤗Transformers	2	2120	November 5, 2020

Practicality and Efficiency of Using Non-Power-of-Two Context Lengths in Fine-Tuning Hugging Face Models for SFT or Fine-Tuning

Related topics