How to obtain faster Inference Endpoint

manevamarija · January 23, 2025, 3:49pm

Good evening, I am currently working on fine tuning on stable diffusion 3.5 large turbo. I have fine tuned the model and obtained the lora weights then I put the weights into a repository. After that ,I created an inference endpoint that recalls the base model and also the adapter (lora weights), but I am having the problem that it is too slow. With L4 GPU it is taking around 30 seconds and with A100 GPU more than 10 (around 15 sec).I also tried the function fuse lora weights and I uploded the entire model with the fused lora weights on a repository, but then when I try to create an Inference Endpoint, with L4 is not even uploading (it says “failed”) and with A100 is once again slow. I’d like to speed the inference time. Is there a way to do it? Is there also a more optimal way on how to upload the model and weights? Any raccomandations are welcomed.

Alanturner2 · January 23, 2025, 3:57pm

Hi @manevamarija !
It sounds like you’re encountering some of the usual challenges with deploying Stable Diffusion models. Let’s break down the speed issues and upload problems:

Why is inference slow? Large language models like Stable Diffusion are computationally intensive. Applying LoRAs adds a small overhead, but the base model’s size is the primary factor.
Solutions for Speed:
- Quantization: This is often the most effective way to speed up inference. It reduces the precision of the model’s weights (e.g., from FP32 to FP16 or INT8), leading to smaller model size and faster computations. Look into bitsandbytes for 8-bit quantization or PyTorch’s built-in quantization tools.
- Optimization Libraries (xformers, torch.compile): These libraries optimize the attention mechanisms, which are a major bottleneck in transformers. xformers is especially effective for SD. torch.compile can also offer significant speedups with the right settings (e.g. mode="reduce-overhead").
- Different Base Model: SD 1.5 is significantly smaller and faster than later versions. If your requirements allow, consider fine-tuning on 1.5.
- Batching: If you’re generating multiple images at once, batching can significantly improve throughput.
- Hardware Efficiency: Monitor your GPU utilization. A low utilization suggests a bottleneck elsewhere. Check your CPU usage, data loading times, and network bandwidth.
Fused Model Upload Issues:
- L4 “failed” upload: This is almost certainly due to insufficient VRAM on the L4. Fusing the LoRA increases the model size, exceeding the L4’s capacity. Quantization is essential here.
- A100 Slow Inference with Fused Model: Double-check your inference code. Are you accidentally re-fusing the LoRA every time you run inference? The fusion should be done once during model loading. The resulting checkpoint should then be loaded directly.
Optimal Upload Strategy:
- Separate LoRA/Base Model: This is often more flexible. Use a library like diffusers which provides efficient LoRA loading and application. Ensure the LoRA is loaded only once.
- Fused Model: Once fused, save the entire model as a single checkpoint. This is the most efficient for inference as there’s no runtime LoRA application.

I strongly recommend starting with quantization and xformers. They often provide the biggest gains with minimal code changes. If you can share snippets of your inference code, I might be able to offer more specific advice.
Hope this help!

manevamarija · January 23, 2025, 4:13pm

Hi Alan, thank you for your answer.
I’m currently using the platform “Inference Endpoint” to deploy my model, with everything set to default configurations, but even with custom configurations there isn’t space to costomize that much . I’ll admit, I don’t completely understand how the platform works in detail. If you have any advice on how to optimize my setup or understand the platform better, I’d be more than happy to hear it. I chose it as a platform to deploy my models through inference endpoints since its implementation is very fast and I’d like to continue using it, that’s why I am searching for a solution to obtain a faster inference.

Alanturner2 · January 24, 2025, 12:08am

Hi there,

Thanks for sharing more details about your setup! The Inference Endpoint platform is a great choice for quick deployments, but optimizing inference speed can depend on a few factors. Here are some tips to help you get better performance:

Model Size and Optimization:
- If you’re using a large model, consider using a distilled or quantized version of the model, as these are typically smaller and faster while maintaining similar performance. Hugging Face offers tools like Transformers Optimum that can help optimize models for deployment.
Batching Requests:
- If your use case allows, process multiple requests in a single batch. Batching can significantly improve throughput and reduce latency on GPU-based inference.
Hardware Selection:
- If you’re using default hardware configurations, try upgrading to GPUs or higher-tier CPUs if your budget allows. GPUs, especially ones like NVIDIA A100, can drastically improve inference speeds for large models.
Use Dynamic Quantization:
- If you’re working with PyTorch models, dynamic quantization is a quick way to reduce the size of the model and improve inference speed with minimal impact on accuracy.
Pipeline Optimization:
- Review your preprocessing and postprocessing steps. Sometimes, inefficiencies here can contribute to delays. Tools like FastAPI or async programming can help speed things up if you’re handling requests programmatically.
Profile Your Model:
- Use tools like Hugging Face’s transformers-cli or performance profilers (e.g., TensorBoard, NVIDIA Nsight) to identify where the bottlenecks are during inference.

If you’d like a more detailed walkthrough of one of these steps or have specific details about your use case, feel free to share, and I’d be happy to help!

Best,
Alan

John6666 · January 24, 2025, 5:35am

stable diffusion 3.5 large turbo

I have never used the Inference Endpoint API myself, so I don’t know the actual specifications well, but with the Serverless Inference API, the default for steps is often set to around 30. This can produce good results with many models. However, the more steps there are, the longer it takes to generate them in direct proportion. There are also some who say that 8 steps are sufficient for SD3.5 Turbo and Large Turbo, so why not try specifying the steps?

manevamarija · January 24, 2025, 4:03pm

Thank you Alan and John. I’ll try all of your suggestions and in case I have problems I’ll let you know.

manevamarija · January 27, 2025, 5:48pm

Hello. It’s still me. I haven’t found I way to optimize my model. I tried with quantization but I think that the problem is that it transforms the model in a version that is not supported by the Inference endpoints in Hugging Face since it says “failed” each time I try to implement it. If you have more knowledge on this , would you guide me on how to do it? Maybe I am doing it wrong. Another thing I wanted to ask is about the previous model I tried to upload on a endpoint, which is the stable diffusione v1.5 and at the beginning was very slow, but then it began running inference fast out of nowhere. Is it because the pipeline was cached so each time I ran the endpoint it wasn’t wasting time reloading the pipeline all over again? This is the only logical explanation. And if it’s so, why the endpoint is not caching stable diffusion 3.5 ? Thank you in advance

Topic		Replies	Views
GPTQ+PEFT model running very slowly at inference Intermediate	4	1690	October 24, 2023
Inference Endpoints creation Intermediate	1	467	January 14, 2024
Using Lora for inference 🤗Transformers	1	683	November 18, 2023
Optimizing LLM Inference with One Base LLM and Multiple LoRA Adapters for Memory Efficiency 🤗Transformers	1	4642	January 20, 2024
Inference after QLoRA fine-tuning Intermediate	8	6218	June 7, 2024

How to obtain faster Inference Endpoint

Related topics