Generate keeps increasing memory usage on ubuntu

WorkerThread · May 24, 2025, 8:01am

I am running inference on a relatively small mode (9B). However, after a few iteations it runs out of memory despite having 32GBs of VRAM.
I have an rtx 5090 at home and on Windows i do not run into this issue it never goes above 16GB VRAM usage.
I spun up a compute node online on Ubuntu. There the training is much slower and continuously eats resources until it OOMs
I am no_grading, evaling, gc collecting, cuda emptying cache, etc.
nothing seems to work. There is some dangling pointers somewhere in the backend and i cant resolve it

Minimal code:

import transformers
import torch
import random
random.seed(3407)
from itertools import combinations
from transformers import GenerationConfig
import gc
from transformers import AutoTokenizer, AutoModelForCausalLM
generation_params = GenerationConfig(
    max_new_tokens=328,              
    temperature=0.1,
    top_k=25,
    top_p=1,
    repetition_penalty=1.1,
    eos_token_id=[1,107],
    do_sample=True
)

model_id = "INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    use_default_system_prompt=False,
)
model = AutoModelForCausalLM.from_pretrained(model_id,
    torch_dtype=torch.bfloat16,
    attn_implementation="eager",
    device_map="cuda")


count = 0
for n in range(210):
    for a in range(119):
        for v in range(108):
            for f in range(110):
                featur = ""
                for ft in range(100):
                                    messages = [
                        {"role": "user", "content": f"Write a lengthy story about a frog and its friends meeting a stork. The story needs to be about 3 to 5 paragraphs. The frogs were originally afraid of the stork, but then grew to like him. Wrtie a happy ending"},
                    ]
                
                
                if random.random() > 0.01:
                    continue
                count += 1
                if count < 5012:
                    continue
                print(messages)
                with torch.no_grad():
                    input_ids = tokenizer.apply_chat_template(
                    messages,
                    return_tensors="pt",
                    add_generation_prompt=True,
                    return_dict=True
                ).to("cuda")
                    outputs = model.generate(
                **input_ids,
                generation_config=generation_params
                )
                with open("data.json", "a", encoding="utf-8") as f:
                    f.write("{ \"prompt\": \"")
                    f.write(messages[0]["content"])
                    f.write("\",\n\"data\": \"")
                    f.write(tokenizer.decode(outputs[0]))
                    f.write("\"}\n")
                gc.collect()
                torch.cuda.empty_cache()

Cuda version: 12.8
Python: 3.11
Pytorch: 2.8.0
Transformers: Latest

John6666 · May 24, 2025, 10:35am

Pytorch: 3.11

Maybe Python ?

John6666 · May 24, 2025, 10:39am

I found the opposite pattern, but it’s rare that only Windows is okay. I’m a Windows user too…
https://stackoverflow.com/questions/78566798/oom-memory-increase-issue-in-model-training-with-pytorch-on-wsl2

github.com/pytorch/pytorch

WSL - Unexpected OOM error upon initialising torch.cuda

opened 11:31AM - 17 Apr 24 UTC

thomas-xin

module: cuda module: memory usage triaged module: wsl

### 🐛 Describe the bug # Issue The following error occurs in one form or ano…ther whenever most PyTorch initialisation methods are called: ```py File "/tmp/pip-build-env-rrigorsf/overlay/lib/python3.11/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability prop = get_device_properties(device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/tmp/pip-build-env-rrigorsf/overlay/lib/python3.11/site-packages/torch/cuda/__init__.py", line 453, in get_device_properties _lazy_init() # will define _get_device_properties ^^^^^^^^^^^^ File "/tmp/pip-build-env-rrigorsf/overlay/lib/python3.11/site-packages/torch/cuda/__init__.py", line 302, in _lazy_init torch._C._cuda_init() RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory ``` # Reproducing The issue is easily reproduced by any of the following: ```py Python 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.cuda.init() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/txin/.local/lib/python3.11/site-packages/torch/cuda/__init__.py", line 269, in init _lazy_init() File "/home/txin/.local/lib/python3.11/site-packages/torch/cuda/__init__.py", line 302, in _lazy_init torch._C._cuda_init() RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory >>> torch.cuda.is_available() /home/txin/.local/lib/python3.11/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 False ``` # Attempted workarounds I have tried installing in many environments over the last year, none with much success. As long as I am using WSL2, changing none of the following made any difference: - Linux distributions (Ubuntu vs Debian etc) - NVIDIA driver versions (543 vs 551 etc) - Python versions (3.9.10 vs 3.10.12 etc) - PyTorch versions (2.0.1 vs 2.1.2 etc) - CUDA versions (11.8 vs 12.1 etc) - CUDA toolkit versions (although I don't remember the previous versions I attempted) Ultimately, I have decided to post an issue here, as I have yet to see an official fix for this issue at any point in time. It is the main problem preventing me from working with a multitude of other programs/frameworks. # Known workarounds The following modifications have been known to "fix" (or more accurately, bypass this issue, allowing `torch.cuda.init()` and related functions to be called without erroring): - Setting environment variable `PYTORCH_NVML_BASED_CUDA_CHECK=1` before running the script - Update: This no longer appears to work. It appears either the error is occurring somewhere else, or the flag is no longer being used - Calling `torch.cuda.device_count()` immediately after torch is imported in the script - Directly modifying `torch/cuda/__init__.py` to include a call to `device_count()` immediately after the function is defined ### Unfortunately, this is not a sustainable fix, as it does not work for installing libraries that depend on specific versions of one another, as if they include torch it will be reinstalled from downloaded source/binaries and installed in a separate temporary environment, and will not permit such modifications during the install. ### Versions ```py python collect_env.py Collecting environment information... /home/txin/.local/lib/python3.11/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 PyTorch version: 2.2.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35 Python version: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 GPU 2: NVIDIA GeForce RTX 4070 GPU 3: NVIDIA GeForce RTX 3090 GPU 4: NVIDIA GeForce RTX 3060 GPU 5: NVIDIA GeForce RTX 3060 Nvidia driver version: 552.12 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: AuthenticAMD Model name: AMD Ryzen 5 7600 6-Core Processor CPU family: 25 Model: 97 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 Stepping: 2 BogoMIPS: 7599.87 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm Virtualization: AMD-V Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 192 KiB (6 instances) L1i cache: 192 KiB (6 instances) L2 cache: 6 MiB (6 instances) L3 cache: 32 MiB (1 instance) Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.2.0 [pip3] triton==2.2.0 [conda] Could not collect ``` cc @ptrblck

WorkerThread · May 24, 2025, 11:49am

@John6666 You are right it is the python version, I mistyped. Edited now.

John6666 · May 24, 2025, 12:29pm

Since it’s the 50x0 series, it’s the latest PyTorch…
So it doesn’t seem to be a case of PyTorch being outdated.

However, since the model used is Gemma 2, it’s unlikely to be a new bug in Transformers (although there was a significant change in behavior between 4.48.3 and 4.49.0…). There also didn’t seem to be any similar OOM-related issues with Transformers.

While searching through PyTorch issues, I found that NCCL behavior can be a bit inconsistent depending on the version. However, I don’t think there are any cases that match exactly.

github.com/pytorch/pytorch

NCCL out of memory error after updating to PyTorch 2.7

opened 04:54AM - 28 Apr 25 UTC

BaconGabe

oncall: distributed triaged module: nccl module: regression

### 🐛 Describe the bug After updating to PyTorch 2.7, using init process group …with nccl and calling `DDP(model, device_ids=[rank])` results in a out of memory error. This makes absolutely no sense because it happens even when I am using extremely small amounts of memory, and DDP with nccl worked perfectly fine before the update on the same code. Here is the error: ``` W0428 00:47:04.140000 51980 .venv/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 52051 via signal SIGTERM -- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/.../.venv/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap fn(i, *args) File "/home/.../example.py", line 39, in demo_basic ddp_model = DDP(model, device_ids=[rank]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/.../.venv/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line 835, in __init__ _verify_param_shape_across_processes(self.process_group, parameters) File "/home/.../.venv/lib/python3.12/site-packages/torch/distributed/utils.py", line 282, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 2 'out of memory' ``` The demo code on how to use DDP provided by PyTorch produces the same error: ```python import os import sys import tempfile import torch import torch.distributed as dist import torch.nn as nn import torch.optim as optim import torch.multiprocessing as mp from torch.nn.parallel import DistributedDataParallel as DDP def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' # initialize the process group dist.init_process_group("nccl", rank=rank, world_size=world_size) def cleanup(): dist.destroy_process_group() class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.net1 = nn.Linear(10, 10) self.relu = nn.ReLU() self.net2 = nn.Linear(10, 5) def forward(self, x): return self.net2(self.relu(self.net1(x))) def demo_basic(rank, world_size): print(f"Running basic DDP example on rank {rank}.") setup(rank, world_size) # create model and move it to GPU with id rank model = ToyModel().to(rank) ddp_model = DDP(model, device_ids=[rank]) # HERE IS WHERE THE ERROR OCCURS loss_fn = nn.MSELoss() optimizer = optim.SGD(ddp_model.parameters(), lr=0.001) optimizer.zero_grad() outputs = ddp_model(torch.randn(20, 10)) labels = torch.randn(20, 5).to(rank) loss_fn(outputs, labels).backward() optimizer.step() cleanup() print(f"Finished running basic DDP example on rank {rank}.") def run_demo(demo_fn, world_size): mp.spawn(demo_fn, args=(world_size,), nprocs=world_size, join=True) if __name__ == "__main__": run_demo(demo_basic, 2) ``` ### Versions PyTorch version: 2.7.0+cu128 Is debug build: False CUDA used to build PyTorch: 12.8 ROCM used to build PyTorch: N/A OS: Ubuntu 24.04.2 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39 Python version: 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090 GPU 1: NVIDIA GeForce RTX 5090 GPU 2: NVIDIA GeForce RTX 4090 Nvidia driver version: 576.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: AuthenticAMD Model name: AMD Ryzen Threadripper 7980X 64-Cores CPU family: 25 Model: 24 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 1 Stepping: 1 BogoMIPS: 6390.51 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm Virtualization: AMD-V Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 2 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 64 MiB (64 instances) L3 cache: 32 MiB (1 instance) Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==2.1.2 [pip3] nvidia-cublas-cu12==12.8.3.14 [pip3] nvidia-cuda-cupti-cu12==12.8.57 [pip3] nvidia-cuda-nvrtc-cu12==12.8.61 [pip3] nvidia-cuda-runtime-cu12==12.8.57 [pip3] nvidia-cudnn-cu12==9.7.1.26 [pip3] nvidia-cufft-cu12==11.3.3.41 [pip3] nvidia-curand-cu12==10.3.9.55 [pip3] nvidia-cusolver-cu12==11.7.2.55 [pip3] nvidia-cusparse-cu12==12.5.7.53 [pip3] nvidia-cusparselt-cu12==0.6.3 [pip3] nvidia-nccl-cu12==2.26.2 [pip3] nvidia-nvjitlink-cu12==12.8.61 [pip3] nvidia-nvtx-cu12==12.8.55 [pip3] pytorch-lightning==2.5.1.post0 [pip3] pytorch_optimizer==3.5.1 [pip3] torch==2.7.0+cu128 [pip3] torchaudio==2.7.0+cu128 [pip3] torchmetrics==1.7.1 [pip3] torchvision==0.22.0+cu128 [pip3] triton==3.3.0 [conda] Could not collect cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

github.com/pytorch/pytorch

Training/Fine-tuning fails with PyTorch 2.8 + 4x 5090 GPUs using DDP/FSDP/DeepSpeed

opened 06:56PM - 05 Apr 25 UTC

felixliufei

oncall: distributed triaged module: ddp module: fsdp

### 🐛 Describe the bug Hi everyone, I seem to have hit a roadblock and could us…e some help or clarification. Environment: * PyTorch Version: 2.8 (Is this correct? Please confirm the exact version) * GPUs: 4 x NVIDIA 5090 * Parallelism Strategy Tried: DistributedDataParallel (DDP), FullyShardedDataParallel (FSDP), DeepSpeed * Task: Training / Fine-tuning (Inference works fine) * Other relevant environment details (Please add if possible): * Operating System: [Ubuntu 22.04] * CUDA Version: [12.8] * NVIDIA Driver Version: [570] * Python Version: [3.10] Problem Description: I am currently unable to successfully run training or fine-tuning jobs when using data parallelism on a system equipped with 4 NVIDIA 5090 GPUs and PyTorch 2.8. I have attempted to use standard DistributedDataParallel (DDP), FullyShardedDataParallel (FSDP), and also integrated DeepSpeed, but all attempts fail during the training/fine-tuning phase. Interestingly, running inference tasks on the same multi-GPU setup works without issues. The problem appears specifically related to the training/fine-tuning process combined with data parallelism libraries. Question: Is there a known limitation or incompatibility with PyTorch 2.8 (or the associated libraries like DDP, FSDP, DeepSpeed) that prevents data parallel training/fine-tuning on a 4x NVIDIA 5090 configuration? Or could there be other configuration issues I might be overlooking? Any insights, confirmation of compatibility, or suggestions for troubleshooting would be greatly appreciated. If specific error messages or a minimal reproducible code example would be helpful, please let me know, and I can try to provide them. Thanks for your help ### Versions wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py # For security purposes, please check the contents of collect_env.py before running it. python collect_env.py cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @zhaojuanmao @mrshenli @rohan-varma @chauhang @mori360

github.com/pytorch/pytorch

pytorch for NVIDIA-5090

opened 06:50AM - 29 Mar 25 UTC

Reginald-L

needs reproduction module: binaries module: cuda triaged

I am installing the pytorch gpu version on an RTX5090 device, but I am getting a…n error: ![Image](https://github.com/user-attachments/assets/4bae051c-4539-4513-bae4-b7317ec8ddcf) here is my torch version: Name: torch Version: 2.8.0.dev20250327+cu128 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3-Clause Location: /home/air/anaconda3/envs/kohya/lib/python3.12/site-packages Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-cufile-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-cusparselt-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, pytorch-triton, setuptools, sympy, typing-extensions Required-by: torchaudio, torchvision here is my os info: ![Image](https://github.com/user-attachments/assets/213e48bf-f40f-41c2-b075-175463a35d9d) cc @seemethere @malfet @osalpekar @atalman @ptrblck @msaroufim @eqy

WorkerThread · May 24, 2025, 3:47pm

have you been able to reproduce it? Or is it just that gpu being faulty or something

John6666 · May 25, 2025, 12:11am

Um… I don’t have 5090, so I can’t reproduce it…

The most suspicious is the PyTorch version, followed by the Transoformers version, then the CUDA Toolkit version, and then possibly a GPU failure. The latest version is always full of bugs…

Topic		Replies	Views
torch.distributed.elastic.multiprocessing.errors.ChildFailedError 🤗Transformers	19	40149	January 22, 2025
Error when fine-tuning on multi-gpu 🤗Transformers	1	596	February 17, 2025
Am I doing multiple GPU right? Intermediate	8	449	November 29, 2024
torch.cuda.OutOfMemoryError 🤗Transformers	0	2059	July 5, 2023
OOM when I using torch.nn.parallel.DistributedDataParallel to train LLAMA-7B Beginners	0	723	May 12, 2023

Generate keeps increasing memory usage on ubuntu

Related topics