Practicality and Efficiency of Using Non-Power-of-Two Context Lengths in Fine-Tuning Hugging Face Models for SFT or Fine-Tuning

I am considering increasing the context length of my model for fine-tuning, but I am curious about the practicality and efficiency of using non-power-of-two context lengths. Specifically, I would like to know:

  1. Efficiency Implications: How does using a non-power-of-two context length impact memory usage and computation efficiency? Are there specific optimizations that are lost or become less effective with non-power-of-two lengths?
  2. Model Performance: Have there been any observed impacts on model performance, such as generation quality or convergence speed, when using non-power-of-two context lengths?
  3. Best Practices: What are the recommended practices for increasing context length beyond typical values (e.g., 2048 or 4096)? Should increments be made gradually, and are there specific configurations or techniques (e.g., Sample Packing, Flash Attention) that can help manage potential inefficiencies?

I have found some references indicating that frameworks are optimized for powers of two and that non-power-of-two lengths might lead to inefficiencies and potential performance degradation. However, practical insights and experiences from the community would be incredibly valuable.

References:

  • Everything About Long Context Fine-tuning discusses memory and efficiency considerations in long-context fine-tuning.
  • Hugging Face Forums - Practicality of Context Length highlights potential issues with exceeding typical context lengths.
  • MosaicML Documentation on Fine-tuning provides a tutorial on fine-tuning Hugging Face models, including efficiency tips.

Also, I have seen batch sizes not powers of two so are context lengths special? I conjecture no)