There are any number of models on HuggingFaces that seem to require flash_attn, even though my understanding is most models can actually work fine without it. A few examples:
What is the best practice to get them working on Apple M2/M3 laptops (ideally teally with Metal support)? Obviously flash_attn won’t be available, but there is still plenty of value in working with models locally on a laptop before they need the higher efficiency of flash_attn and CUDA.
I’ve found a few directional hints, but none of them have worked:
In theory you should be able to monkey patch out the exception triggered in transformers.dynamic_module_utils but I cannot get that to work
In theory you should be able to FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn==2.5.8 but that fails to build (due to some strange issue with os.rename not working on Mac OS).
Has anybody gotten these models working? Is there a general solution that Huggingface can implement to allow these models to run / train (even if it isn’t very efficient) on non CUDA devices?