Floating point exception with nightly pytorch and cuda

MxtAppz · July 16, 2025, 2:00pm

Hi,

First of all excuse me if this post is off-topic but I believe this issue is caused by Diffusers, but I’m not opening an issue in git because it’s probably a misconfiguration rather than a bug. So I’m new here in transformers and diffusers and I’m facing some issues with nightly versions of pytorch. Specifically, nvidia-smi returns CUDA version 12.9 and NVIDIA drivers version 575 (see below) and I installed the nightly pytorch to be compatible with this version of cuda using the website’s selector. I executed some testing scripts and they confirm pytorch is working fine (some complex math calculations using CUDA). However, when I try to run a vision model using Diffusers I get Floating point exception, and that’s it, not even the typical traceback. Specifically, I tried the example code snippet for Stable Diffusion 3.5 medium:

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

image = pipe(
    "A capybara holding a sign that reads Hello World",
    num_inference_steps=40,
    guidance_scale=4.5,
).images[0]
image.save("capybara.png")

I had no luck finding solutions to my problem, as result I found were regarding code issues not libraries/drivers issues. Here I leave you some relevant information about my environment:

python
Python 3.11.2 (main, Apr 28 2025, 14:11:48) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda
'12.9'
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count
<function device_count at 0x7f60497056c0>
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name()
'NVIDIA GeForce RTX 5060 Ti'

I installed this pytorch using pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu129

pip list
Package                  Version
------------------------ ------------------------
bitsandbytes             0.46.1
certifi                  2025.7.14
charset-normalizer       3.4.2
diffusers                0.34.0
filelock                 3.18.0
fsspec                   2025.7.0
hf-xet                   1.1.5
huggingface-hub          0.33.4
idna                     3.10
importlib_metadata       8.7.0
Jinja2                   3.1.6
MarkupSafe               3.0.2
mpmath                   1.3.0
networkx                 3.5
numpy                    2.3.1
nvidia-cublas-cu12       12.9.1.4
nvidia-cuda-cupti-cu12   12.9.79
nvidia-cuda-nvrtc-cu12   12.9.86
nvidia-cuda-runtime-cu12 12.9.79
nvidia-cudnn-cu12        9.10.2.21
nvidia-cufft-cu12        11.4.1.4
nvidia-cufile-cu12       1.14.1.1
nvidia-curand-cu12       10.3.10.19
nvidia-cusolver-cu12     11.7.5.82
nvidia-cusparse-cu12     12.5.10.65
nvidia-cusparselt-cu12   0.7.1
nvidia-nccl-cu12         2.27.5
nvidia-nvjitlink-cu12    12.9.86
nvidia-nvshmem-cu12      3.3.9
nvidia-nvtx-cu12         12.9.79
packaging                25.0
pillow                   11.2.1
pip                      23.0.1
pytorch-triton           3.4.0+gitae848267
PyYAML                   6.0.2
regex                    2024.11.6
requests                 2.32.4
safetensors              0.5.3
setuptools               66.1.1
sympy                    1.14.0
torch                    2.9.0.dev20250716+cu129
torchaudio               2.8.0.dev20250716+cu129
torchvision              0.24.0.dev20250716+cu129
tqdm                     4.67.1
triton                   3.3.1
typing_extensions        4.14.1
urllib3                  2.5.0
zipp                     3.23.0

nvidia-smi
Wed Jul 16 15:58:48 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti     On  |   00000000:01:00.0  On |                  N/A |
|  0%   42C    P5              4W /  180W |      10MiB /  16311MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Let me know if you need any additional information or if I can test something for you. FYI, Ollama runs fine in this setup. Thanks for your time.

John6666 · July 16, 2025, 11:42pm

I think this is the cause in the Hopper architecture, but you are using Blackwell…

github.com/huggingface/diffusers

Train Flux ControlNet Error: `Floating point exception (core dumped)`

opened 06:28AM - 09 Apr 25 UTC

closed 03:09AM - 10 Apr 25 UTC

paizero

bug

### Describe the bug I got an error when i training `Flux ControlNet` with scri…pt `examples/controlnet/train_controlnet_flux.py`. I have tried to check my datasets and there are no problems. Here is my launch command: ```shell accelerate launch train_flux_controlnet.py --pretrained_model_name_or_path="/root/model/flux-dev" --dataset_name=/root/Datasets/ControlNetV1 --conditioning_image_column=conditioning_image --image_column=image --caption_column=text --output_dir="/root/autodl-tmp/Outputs" --mixed_precision="bf16" --resolution=1024 --learning_rate=2e-5 --max_train_steps=35000 --checkpointing_steps=5000 --train_batch_size=1 --gradient_accumulation_steps=4 --report_to="wandb" --num_double_layers=4 --num_single_layers=0 --seed=42 ``` And here is error report: ``` wandb: 🚀 View run at https://wandb.ai/paizero-ailab/flux_train_controlnet/runs/9x04zbis 04/09/2025 14:24:09 - INFO - __main__ - ***** Running training ***** 04/09/2025 14:24:09 - INFO - __main__ - Num examples = 2 04/09/2025 14:24:09 - INFO - __main__ - Num batches each epoch = 2 04/09/2025 14:24:09 - INFO - __main__ - Num Epochs = 35000 04/09/2025 14:24:09 - INFO - __main__ - Instantaneous batch size per device = 1 04/09/2025 14:24:09 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 4 04/09/2025 14:24:09 - INFO - __main__ - Gradient Accumulation steps = 4 04/09/2025 14:24:09 - INFO - __main__ - Total optimization steps = 35000 Steps: 0%| | 0/35000 [00:00<?, ?it/s]Floating point exception (core dumped) ``` ### Reproduction The error occurred in the following part of the code: ```python controlnet_block_samples, controlnet_single_block_samples = flux_controlnet( hidden_states=noisy_model_input, controlnet_cond=control_image, timestep=timesteps / 1000, guidance=guidance_vec, pooled_projections=batch["unet_added_conditions"]["pooled_prompt_embeds"].to(dtype=weight_dtype), encoder_hidden_states=batch["prompt_ids"].to(dtype=weight_dtype), txt_ids=batch["unet_added_conditions"]["time_ids"][0].to(dtype=weight_dtype), img_ids=latent_image_ids, return_dict=False, ) ``` ### Logs ```shell 4/09/2025 14:23:46 - INFO - __main__ - Distributed environment: DistributedType.NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Mixed precision type: bf16 You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.73it/s] All model checkpoint weights were used when initializing AutoencoderKL. All the weights of AutoencoderKL were initialized from the model checkpoint at /root/autodl-tmp/model. If your task is similar to the task the model of the checkpoint was trained on, you can already use AutoencoderKL for predictions without further training. {'out_channels', 'axes_dims_rope'} was not found in config. Values will be initialized to default values. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.31it/s] All model checkpoint weights were used when initializing FluxTransformer2DModel. All the weights of FluxTransformer2DModel were initialized from the model checkpoint at /root/autodl-tmp/model. If your task is similar to the task the model of the checkpoint was trained on, you can already use FluxTransformer2DModel for predictions without further training. 04/09/2025 14:23:51 - INFO - __main__ - Initializing controlnet weights from transformer {'conditioning_embedding_channels', 'num_mode', 'axes_dims_rope'} was not found in config. Values will be initialized to default values. /root/autodl-tmp/workspace/diffusers/src/diffusers/models/controlnet_flux.py:49: FutureWarning: `diffusers.models.controlnet_flux.FluxControlNetModel` is deprecated and will be removed in version 0.34. Importing `FluxControlNetModel` from `diffusers.models.controlnet_flux` is deprecated and this will be removed in a future version. Please use `from diffusers.models.controlnets.controlnet_flux import FluxControlNetModel`, instead. deprecate("diffusers.models.controlnet_flux.FluxControlNetModel", "0.34", deprecation_message) 04/09/2025 14:23:57 - INFO - __main__ - all models loaded successfully {'use_karras_sigmas', 'invert_sigmas', 'use_beta_sigmas', 'shift_terminal', 'time_shift_type', 'use_exponential_sigmas'} was not found in config. Values will be initialized to default values. wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Currently logged in as: paizero (paizero-ailab) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.19.9 wandb: Run data is saved locally in /root/workspace/wandb/run-20250409_142408-9x04zbis wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run usual-grass-35 wandb: ⭐️ View project at https://wandb.ai/paizero-ailab/flux_train_controlnet wandb: 🚀 View run at https://wandb.ai/paizero-ailab/flux_train_controlnet/runs/9x04zbis 04/09/2025 14:24:09 - INFO - __main__ - ***** Running training ***** 04/09/2025 14:24:09 - INFO - __main__ - Num examples = 2 04/09/2025 14:24:09 - INFO - __main__ - Num batches each epoch = 2 04/09/2025 14:24:09 - INFO - __main__ - Num Epochs = 35000 04/09/2025 14:24:09 - INFO - __main__ - Instantaneous batch size per device = 1 04/09/2025 14:24:09 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 4 04/09/2025 14:24:09 - INFO - __main__ - Gradient Accumulation steps = 4 04/09/2025 14:24:09 - INFO - __main__ - Total optimization steps = 35000 Steps: 0%| | 0/35000 [00:00<?, ?it/s]Floating point exception (core dumped) ``` ### System Info ``` - 🤗 Diffusers version: 0.33.0.dev0 - Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35 - Running on Google Colab?: No - Python version: 3.12.3 - PyTorch version (GPU?): 2.3.1+cu121 (True) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Huggingface_hub version: 0.30.1 - Transformers version: 4.44.0 - Accelerate version: 1.6.0 - PEFT version: not installed - Bitsandbytes version: 0.45.5 - Safetensors version: 0.4.4 - xFormers version: 0.0.27 - Accelerator: NVIDIA H20, 97871 MiB - Using GPU in script?: yes - Using distributed or parallel set-up in script?: no ``` ### Who can help? _No response_

github.com/InternLM/lmdeploy

DeepSeek-R1-671B Floating point exception

opened 03:03AM - 21 Mar 25 UTC

closed 04:51PM - 26 Mar 25 UTC

grimoire

> Loading weights from safetensors: 98%|█████████▊| 159/163 [01:18<00:02, 1.87…it/s] > Loading weights from safetensors: 98%|█████████▊| 160/163 [01:19<00:01, 1.60it/s] > 2025-03-20 13:29:45,957 - lmdeploy - WARNING - base.py:111 - Update `block_size=32` for large `head_dim=576`. > Loading weights from safetensors: 100%|██████████| 163/163 [01:19<00:00, 2.06it/s] > 2025-03-20 13:29:46,479 - lmdeploy - WARNING - async_engine.py:643 - GenerationConfig: GenerationConfig(n=1, max_new_tokens=512, do_sample=False, top_p=1.0, top_k=50, min_p=0.0, temperature=0.8, repetition_penalty=1.0, ignore_eos=False, random_seed=None, stop_words=None, bad_words=None, stop_token_ids=[1], bad_token_ids=None, min_new_tokens=None, skip_special_tokens=True, spaces_between_special_tokens=True, logprobs=None, response_format=None, logits_processors=None, output_logits=None, output_last_hidden_state=None) > 2025-03-20 13:29:46,480 - lmdeploy - WARNING - async_engine.py:644 - Since v0.6.0, lmdeploy add `do_sample` in GenerationConfig. It defaults to False, meaning greedy decoding. Please set `do_sample=True` if sampling decoding is needed > 2025-03-20 13:29:46,492 - lmdeploy - WARNING - tokenizer.py:433 - Detected duplicate bos token 0 in prompt, this will likely reduce response quality, one of them will beremoved > (RayWorkerWrapper pid=1748056) *** SIGFPE received at time=1742477392 on cpu 171 *** > (RayWorkerWrapper pid=1748056) PC: @ 0x7f69e04ff921 (unknown) (unknown) > (RayWorkerWrapper pid=1748056) @ 0x7f99ea1c2520 (unknown) (unknown) > (RayWorkerWrapper pid=1748056) [2025-03-20 13:29:52,056 E 1748056 1758247] logging.cc:484: *** SIGFPE received at time=1742477392 on cpu 171 *** > (RayWorkerWrapper pid=1748056) [2025-03-20 13:29:52,056 E 1748056 1758247] logging.cc:484: PC: @ 0x7f69e04ff921 (unknown) (unknown) > (RayWorkerWrapper pid=1748056) [2025-03-20 13:29:52,056 E 1748056 1758247] logging.cc:484: @ 0x7f99ea1c2520 (unknown) (unknown) > (RayWorkerWrapper pid=1748056) Fatal Python error: Floating point exception > (RayWorkerWrapper pid=1748056) > (RayWorkerWrapper pid=1748056) Stack (most recent call first): > (RayWorkerWrapper pid=1748056) File "/home/z00663592/V3/lmdeploy/lmdeploy/pytorch/backends/default/linear.py", line 17 in forward > (RayWorkerWrapper pid=1748056) File "/home/z00663592/V3/lmdeploy/lmdeploy/pytorch/nn/linear.py", line 1094 in forward > (RayWorkerWrapper pid=1748056) File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl > (RayWorkerWrapper pid=1748056) File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl > (RayWorkerWrapper pid=1748056) File "/home/z00663592/V3/lmdeploy/lmdeploy/pytorch/models/deepseek_v2.py", line 652 in get_logits > (RayWorkerWrapper pid=1748056) File "/home/z00663592/V3/lmdeploy/lmdeploy/pytorch/backends/graph_runner.py", line 34 in get_logits > (RayWorkerWrapper pid=1748056) File "/home/z00663592/V3/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 544 in get_logits > (RayWorkerWrapper pid=1748056) File "/home/z00663592/V3/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 256 in _async_model_forward > > > DeepSeek-R1-671B not working on H20*8 _Originally posted by @zitgit in [#2960](https://github.com/InternLM/lmdeploy/issues/2960#issuecomment-2740510816)_ @zitgit could you provide reproduction and environment so we can do the debug?

MxtAppz · July 17, 2025, 1:08am

Yes, effectively I tried downgrading nvidia-cublas-cu12 to 12.4.5.8 but it didn’t fix the issue as the problem was with previous versions of the lib and as you said, with Hopper arch. I’m not sure if I should open an issue with diffusers in github… Thanks.

John6666 · July 17, 2025, 1:55am

Yeah, I think the root cause is probably upstream (PyTorch or CUDA), but we can’t move forward without knowing which part of the Diffusers’ SD 3.5 pipeline is causing the problem, so it would be best to raise an issue with Diffusers. Reproduction is very simple if anyone has a 50x0…

MxtAppz · July 17, 2025, 3:37am

Done, issue is here, I also gave credit to you. It was funny that diffusers-cli env, the tool used to gather environment data for issue logs also failed with the same error.

Topic		Replies	Views
Device Type Error With Diffusers Pipeline TAT Beginners	2	1312	August 19, 2023
Error 'expected scalar type Half but found Float' 🧨 Diffusers	2	3921	November 8, 2022
RuntimeError: Found no NVIDIA driver on your system Spaces	3	1244	October 11, 2022
ERROR: Could not find a version that satisfies the requirement torch==1.7.1+cpu Beginners	17	25231	December 15, 2020
How to fit Versatile Diffusion into colab RAM? 🧨 Diffusers	0	468	November 24, 2022

Floating point exception with nightly pytorch and cuda

Related topics