I am running inference on a relatively small mode (9B). However, after a few iteations it runs out of memory despite having 32GBs of VRAM.
I have an rtx 5090 at home and on Windows i do not run into this issue it never goes above 16GB VRAM usage.
I spun up a compute node online on Ubuntu. There the training is much slower and continuously eats resources until it OOMs
I am no_grading, evaling, gc collecting, cuda emptying cache, etc.
nothing seems to work. There is some dangling pointers somewhere in the backend and i cant resolve it
Minimal code:
import transformers
import torch
import random
random.seed(3407)
from itertools import combinations
from transformers import GenerationConfig
import gc
from transformers import AutoTokenizer, AutoModelForCausalLM
generation_params = GenerationConfig(
max_new_tokens=328,
temperature=0.1,
top_k=25,
top_p=1,
repetition_penalty=1.1,
eos_token_id=[1,107],
do_sample=True
)
model_id = "INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0"
tokenizer = AutoTokenizer.from_pretrained(
model_id,
use_default_system_prompt=False,
)
model = AutoModelForCausalLM.from_pretrained(model_id,
torch_dtype=torch.bfloat16,
attn_implementation="eager",
device_map="cuda")
count = 0
for n in range(210):
for a in range(119):
for v in range(108):
for f in range(110):
featur = ""
for ft in range(100):
messages = [
{"role": "user", "content": f"Write a lengthy story about a frog and its friends meeting a stork. The story needs to be about 3 to 5 paragraphs. The frogs were originally afraid of the stork, but then grew to like him. Wrtie a happy ending"},
]
if random.random() > 0.01:
continue
count += 1
if count < 5012:
continue
print(messages)
with torch.no_grad():
input_ids = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True,
return_dict=True
).to("cuda")
outputs = model.generate(
**input_ids,
generation_config=generation_params
)
with open("data.json", "a", encoding="utf-8") as f:
f.write("{ \"prompt\": \"")
f.write(messages[0]["content"])
f.write("\",\n\"data\": \"")
f.write(tokenizer.decode(outputs[0]))
f.write("\"}\n")
gc.collect()
torch.cuda.empty_cache()
Cuda version: 12.8
Python: 3.11
Pytorch: 2.8.0
Transformers: Latest
1 Like
I found the opposite pattern, but itās rare that only Windows is okay. Iām a Windows user tooā¦
https://stackoverflow.com/questions/78566798/oom-memory-increase-issue-in-model-training-with-pytorch-on-wsl2
opened 11:31AM - 17 Apr 24 UTC
module: cuda
module: memory usage
triaged
module: wsl
### š Describe the bug
# Issue
The following error occurs in one form or ano⦠ther whenever most PyTorch initialisation methods are called:
```py
File "/tmp/pip-build-env-rrigorsf/overlay/lib/python3.11/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability
prop = get_device_properties(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/pip-build-env-rrigorsf/overlay/lib/python3.11/site-packages/torch/cuda/__init__.py", line 453, in get_device_properties
_lazy_init() # will define _get_device_properties
^^^^^^^^^^^^
File "/tmp/pip-build-env-rrigorsf/overlay/lib/python3.11/site-packages/torch/cuda/__init__.py", line 302, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory
```
# Reproducing
The issue is easily reproduced by any of the following:
```py
Python 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.init()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/txin/.local/lib/python3.11/site-packages/torch/cuda/__init__.py", line 269, in init
_lazy_init()
File "/home/txin/.local/lib/python3.11/site-packages/torch/cuda/__init__.py", line 302, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory
>>> torch.cuda.is_available()
/home/txin/.local/lib/python3.11/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
```
# Attempted workarounds
I have tried installing in many environments over the last year, none with much success. As long as I am using WSL2, changing none of the following made any difference:
- Linux distributions (Ubuntu vs Debian etc)
- NVIDIA driver versions (543 vs 551 etc)
- Python versions (3.9.10 vs 3.10.12 etc)
- PyTorch versions (2.0.1 vs 2.1.2 etc)
- CUDA versions (11.8 vs 12.1 etc)
- CUDA toolkit versions (although I don't remember the previous versions I attempted)
Ultimately, I have decided to post an issue here, as I have yet to see an official fix for this issue at any point in time. It is the main problem preventing me from working with a multitude of other programs/frameworks.
# Known workarounds
The following modifications have been known to "fix" (or more accurately, bypass this issue, allowing `torch.cuda.init()` and related functions to be called without erroring):
- Setting environment variable `PYTORCH_NVML_BASED_CUDA_CHECK=1` before running the script
- Update: This no longer appears to work. It appears either the error is occurring somewhere else, or the flag is no longer being used
- Calling `torch.cuda.device_count()` immediately after torch is imported in the script
- Directly modifying `torch/cuda/__init__.py` to include a call to `device_count()` immediately after the function is defined
### Unfortunately, this is not a sustainable fix, as it does not work for installing libraries that depend on specific versions of one another, as if they include torch it will be reinstalled from downloaded source/binaries and installed in a separate temporary environment, and will not permit such modifications during the install.
### Versions
```py
python collect_env.py
Collecting environment information...
/home/txin/.local/lib/python3.11/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 4070
GPU 3: NVIDIA GeForce RTX 3090
GPU 4: NVIDIA GeForce RTX 3060
GPU 5: NVIDIA GeForce RTX 3060
Nvidia driver version: 552.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 5 7600 6-Core Processor
CPU family: 25
Model: 97
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 2
BogoMIPS: 7599.87
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
Virtualization: AMD-V
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 192 KiB (6 instances)
L1i cache: 192 KiB (6 instances)
L2 cache: 6 MiB (6 instances)
L3 cache: 32 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.0
[pip3] triton==2.2.0
[conda] Could not collect
```
cc @ptrblck
@John6666 You are right it is the python version, I mistyped. Edited now.
1 Like
Since itās the 50x0 series, itās the latest PyTorchā¦
So it doesnāt seem to be a case of PyTorch being outdated.
However, since the model used is Gemma 2, itās unlikely to be a new bug in Transformers (although there was a significant change in behavior between 4.48.3
and 4.49.0
ā¦). There also didnāt seem to be any similar OOM-related issues with Transformers.
While searching through PyTorch issues, I found that NCCL
behavior can be a bit inconsistent depending on the version. However, I donāt think there are any cases that match exactly.
opened 04:54AM - 28 Apr 25 UTC
oncall: distributed
triaged
module: nccl
module: regression
### š Describe the bug
After updating to PyTorch 2.7, using init process group ⦠with nccl and calling `DDP(model, device_ids=[rank])` results in a out of memory error. This makes absolutely no sense because it happens even when I am using extremely small amounts of memory, and DDP with nccl worked perfectly fine before the update on the same code.
Here is the error:
```
W0428 00:47:04.140000 51980 .venv/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 52051 via signal SIGTERM
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/.../.venv/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/home/.../example.py", line 39, in demo_basic
ddp_model = DDP(model, device_ids=[rank])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/.../.venv/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line 835, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/.../.venv/lib/python3.12/site-packages/torch/distributed/utils.py", line 282, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 2 'out of memory'
```
The demo code on how to use DDP provided by PyTorch produces the same error:
```python
import os
import sys
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
# initialize the process group
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
def demo_basic(rank, world_size):
print(f"Running basic DDP example on rank {rank}.")
setup(rank, world_size)
# create model and move it to GPU with id rank
model = ToyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank]) # HERE IS WHERE THE ERROR OCCURS
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(rank)
loss_fn(outputs, labels).backward()
optimizer.step()
cleanup()
print(f"Finished running basic DDP example on rank {rank}.")
def run_demo(demo_fn, world_size):
mp.spawn(demo_fn,
args=(world_size,),
nprocs=world_size,
join=True)
if __name__ == "__main__":
run_demo(demo_basic, 2)
```
### Versions
PyTorch version: 2.7.0+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A
OS: Ubuntu 24.04.2 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.39
Python version: 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 5090
GPU 2: NVIDIA GeForce RTX 4090
Nvidia driver version: 576.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper 7980X 64-Cores
CPU family: 25
Model: 24
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
Stepping: 1
BogoMIPS: 6390.51
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
Virtualization: AMD-V
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 2 MiB (64 instances)
L1i cache: 2 MiB (64 instances)
L2 cache: 64 MiB (64 instances)
L3 cache: 32 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==2.1.2
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.7.1.26
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] pytorch-lightning==2.5.1.post0
[pip3] pytorch_optimizer==3.5.1
[pip3] torch==2.7.0+cu128
[pip3] torchaudio==2.7.0+cu128
[pip3] torchmetrics==1.7.1
[pip3] torchvision==0.22.0+cu128
[pip3] triton==3.3.0
[conda] Could not collect
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k
opened 06:56PM - 05 Apr 25 UTC
oncall: distributed
triaged
module: ddp
module: fsdp
### š Describe the bug
Hi everyone,
I seem to have hit a roadblock and could us⦠e some help or clarification.
Environment:
* PyTorch Version: 2.8 (Is this correct? Please confirm the exact version)
* GPUs: 4 x NVIDIA 5090
* Parallelism Strategy Tried: DistributedDataParallel (DDP), FullyShardedDataParallel (FSDP), DeepSpeed
* Task: Training / Fine-tuning (Inference works fine)
* Other relevant environment details (Please add if possible):
* Operating System: [Ubuntu 22.04]
* CUDA Version: [12.8]
* NVIDIA Driver Version: [570]
* Python Version: [3.10]
Problem Description:
I am currently unable to successfully run training or fine-tuning jobs when using data parallelism on a system equipped with 4 NVIDIA 5090 GPUs and PyTorch 2.8. I have attempted to use standard DistributedDataParallel (DDP), FullyShardedDataParallel (FSDP), and also integrated DeepSpeed, but all attempts fail during the training/fine-tuning phase.
Interestingly, running inference tasks on the same multi-GPU setup works without issues. The problem appears specifically related to the training/fine-tuning process combined with data parallelism libraries.
Question:
Is there a known limitation or incompatibility with PyTorch 2.8 (or the associated libraries like DDP, FSDP, DeepSpeed) that prevents data parallel training/fine-tuning on a 4x NVIDIA 5090 configuration? Or could there be other configuration issues I might be overlooking?
Any insights, confirmation of compatibility, or suggestions for troubleshooting would be greatly appreciated. If specific error messages or a minimal reproducible code example would be helpful, please let me know, and I can try to provide them.
Thanks for your help
### Versions
wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @zhaojuanmao @mrshenli @rohan-varma @chauhang @mori360
opened 06:50AM - 29 Mar 25 UTC
needs reproduction
module: binaries
module: cuda
triaged
I am installing the pytorch gpu version on an RTX5090 device, but I am getting a⦠n error:

here is my torch version:
Name: torch
Version: 2.8.0.dev20250327+cu128
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: /home/air/anaconda3/envs/kohya/lib/python3.12/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-cufile-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-cusparselt-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, pytorch-triton, setuptools, sympy, typing-extensions
Required-by: torchaudio, torchvision
here is my os info:

cc @seemethere @malfet @osalpekar @atalman @ptrblck @msaroufim @eqy
have you been able to reproduce it? Or is it just that gpu being faulty or something
1 Like
Um⦠I donāt have 5090, so I canāt reproduce itā¦
The most suspicious is the PyTorch version, followed by the Transoformers version, then the CUDA Toolkit version, and then possibly a GPU failure. The latest version is always full of bugsā¦