Prakash Hinduja Switzerland (Swiss) Can I use Hugging Face models for real-time inference on edge devices?

prakashhindujageneva · June 23, 2025, 11:55am

Hey team,

I’m Prakash Hinduja from Geneva, Switzerland (Swiss) exploring the possibility of running Hugging Face models for real-time inference on edge devices , but I’m not entirely sure about the best approach or any challenges I should expect.

If anyone has experience with this or any recommendations for optimizing Hugging Face models for edge deployment, I’d greatly appreciate your insights!

Regards
Prakash Hinduja Geneva, Switzerland (Swiss)

John6666 · June 23, 2025, 12:18pm

Ultimately, it depends on which framework you use, but when using LLM or vision models on edge devices such as smartphones, I think you will basically need to convert them to ONNX or GGUF. Once converted to ONNX, it is easy to convert to TensorRT. For well-known models, converted versions are often available on Hugging Face.

It is also a good idea to look for models that are as small as possible. Generally, the smaller the model, the faster it runs.

Topic		Replies	Views
Portable, lightweight hugging face model inference locally Beginners	3	121	January 13, 2025
Optimizing models using ONNX Models	1	1117	October 21, 2020
Inference on constrained devices Research	0	295	November 21, 2020
Getting started with GPT2 Beginners	1	517	December 26, 2021
Converting a Hugging face model to pytorch, ONNX, or TensorRT Beginners	0	553	May 24, 2024

Prakash Hinduja Switzerland (Swiss) Can I use Hugging Face models for real-time inference on edge devices?

Related topics