Hey team,
I’m Prakash Hinduja from Geneva, Switzerland (Swiss) exploring the possibility of running Hugging Face models for real-time inference on edge devices , but I’m not entirely sure about the best approach or any challenges I should expect.
If anyone has experience with this or any recommendations for optimizing Hugging Face models for edge deployment, I’d greatly appreciate your insights!
Regards
Prakash Hinduja Geneva, Switzerland (Swiss)
1 Like
Ultimately, it depends on which framework you use, but when using LLM or vision models on edge devices such as smartphones, I think you will basically need to convert them to ONNX or GGUF. Once converted to ONNX, it is easy to convert to TensorRT. For well-known models, converted versions are often available on Hugging Face.
It is also a good idea to look for models that are as small as possible. Generally, the smaller the model, the faster it runs.