RTX 4090 Huggingface Trainer Compatible?

the-pale-king · October 26, 2022, 5:07pm

Hi, I’m trying to train a Huggingface model using Pytorch with an NVIDIA RTX 4090.

The training worked well previously on an RTX 3090.

Currently I am finding that INFERENCE works well on the 4090, but training hangs at 0% progress.

I am training inside this docker container: nvcr.io/nvidia/pytorch:22.09-py3

Here is the output of NVIDIA-SMI … this output seems to be identical both inside and outside of the docker container.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+

My guess is that the problem is with Pytorch, maybe it has not implemented the necessary code to use driver 520, which is required for the RTX 4090?

thanks

the-pale-king · October 26, 2022, 9:33pm

Actually, when my container boots, I get this warning:

WARNING: Detected NVIDIA NVIDIA GeForce RTX 4090 GPU, which is not yet supported in this version of the container

So that’s pretty clear. Has anyone been able to do training inside a docker container with the RTX 4090?

thanks

the-pale-king · October 31, 2022, 3:23am

Just tried with the newest NVIDIA-pytorch 22.10-py3 – released 10/28/2022 – same behavior – hangs with 100% CPU (just using one CPU out of 48 in a Ryzen threadripper) … and 0% training progress.

the-pale-king · November 4, 2022, 12:07pm

Just tried it without docker, exact same results. My current guess is that:

RTX 4090 requires NVIDIA driver 4.20, which uses cuda 11.8
Pytorch is not yet compatible with cuda 11.8

Again, inference works fine with Huggingface tools. Training hangs with 100% CPU and no progress.

If anyone finds different, please let me know – or if you would like to me to run a certain test case on the rtx 4090, I’d be happy to do so.

Crazyai9112 · November 16, 2022, 5:58pm

Hi,

Yes that always happen with new hardware, usually the solution is to download a new container from Nvidia.

Check nvidia ngc and look for the latest version of pytorch, then run your code inside that container and that should do it.

Thanks and if any questions, please let us know.

the-pale-king · November 18, 2022, 6:45pm

To follow up – I have TWO rtx 4090s in this system, and that seems to be related to the problem. I ran the docker container again, with CUDA_VISIBLE_DEVICES: 0 and NVIDIA_VISIBLE_DEVICES: 0 as environment variables, and training worked well on one GPU.

nicolaspanel · December 15, 2022, 12:59am

Hi,
We are facing the exact same issue: DDP training hangs with 100% CPU and no progress when using multiple 4090s. Torch get stuck at (using NVIDIA-pytorch 22.11-py3):

  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 109, in join
    ready = multiprocessing.connection.wait(
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)

@the-pale-king have you been able to find a workaround ?
Best

the-pale-king · December 15, 2022, 1:30am

Yes, this is the fix which works when running in the docker container for me… but note, this only allows usage of one GPU at a time!

I ran the docker container again, with
CUDA_VISIBLE_DEVICES: 0 and
NVIDIA_VISIBLE_DEVICES: 0 as environment variables, and training worked well on one GPU.

the-pale-king · December 16, 2022, 4:53pm

And also – my training process occasionally seems to deadlock and I have to start it from scratch. Have never experienced this on the rtx 3090. Am guessing that the rtx 4090 drivers are not perfect yet.

nicolaspanel · January 22, 2023, 1:12pm

@the-pale-king
setting NCCL_P2P_DISABLE=1 is all you need
best

jwang214 · January 28, 2023, 4:44am

Hi @nicolaspanel
Does setting NCCL_P2P_DISABLE=1 really work?
NCCL_P2P_DISABLE usually helps disable communications between GPUs, why here we need this?

Topic		Replies	Views
Docker image: transformers-all-latest-gpu not running Intermediate	0	912	March 30, 2024
Issue with CUDA Availability on A10 GPU Instance of space Spaces	2	618	November 8, 2023
Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code Intermediate	5	4500	April 9, 2024
Cuda version conundrum Beginners	1	2874	August 4, 2023
Ssues with GPU Configuration and Model Deployment on Hugging Face Spaces Beginners	0	179	May 29, 2024

RTX 4090 Huggingface Trainer Compatible?

Related topics