RTX 4090 Huggingface Trainer Compatible?

Hi, I’m trying to train a Huggingface model using Pytorch with an NVIDIA RTX 4090.

The training worked well previously on an RTX 3090.

Currently I am finding that INFERENCE works well on the 4090, but training hangs at 0% progress.

I am training inside this docker container: nvcr.io/nvidia/pytorch:22.09-py3

Here is the output of NVIDIA-SMI … this output seems to be identical both inside and outside of the docker container.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+

My guess is that the problem is with Pytorch, maybe it has not implemented the necessary code to use driver 520, which is required for the RTX 4090?

thanks

Actually, when my container boots, I get this warning:

WARNING: Detected NVIDIA NVIDIA GeForce RTX 4090 GPU, which is not yet supported in this version of the container

So that’s pretty clear. Has anyone been able to do training inside a docker container with the RTX 4090?

thanks

Just tried with the newest NVIDIA-pytorch 22.10-py3 – released 10/28/2022 – same behavior – hangs with 100% CPU (just using one CPU out of 48 in a Ryzen threadripper) … and 0% training progress.

Just tried it without docker, exact same results. My current guess is that:

  1. RTX 4090 requires NVIDIA driver 4.20, which uses cuda 11.8
  2. Pytorch is not yet compatible with cuda 11.8

Again, inference works fine with Huggingface tools. Training hangs with 100% CPU and no progress.

If anyone finds different, please let me know – or if you would like to me to run a certain test case on the rtx 4090, I’d be happy to do so.

Hi,

Yes that always happen with new hardware, usually the solution is to download a new container from Nvidia.

Check nvidia ngc and look for the latest version of pytorch, then run your code inside that container and that should do it.

Thanks and if any questions, please let us know.

2 Likes

To follow up – I have TWO rtx 4090s in this system, and that seems to be related to the problem. I ran the docker container again, with CUDA_VISIBLE_DEVICES: 0 and NVIDIA_VISIBLE_DEVICES: 0 as environment variables, and training worked well on one GPU.

1 Like

Hi,
We are facing the exact same issue: DDP training hangs with 100% CPU and no progress when using multiple 4090s. Torch get stuck at (using NVIDIA-pytorch 22.11-py3):

  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 109, in join
    ready = multiprocessing.connection.wait(
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)

@the-pale-king have you been able to find a workaround ?
Best

1 Like

Yes, this is the fix which works when running in the docker container for me… but note, this only allows usage of one GPU at a time!

I ran the docker container again, with
CUDA_VISIBLE_DEVICES: 0 and
NVIDIA_VISIBLE_DEVICES: 0 as environment variables, and training worked well on one GPU.

And also – my training process occasionally seems to deadlock and I have to start it from scratch. Have never experienced this on the rtx 3090. Am guessing that the rtx 4090 drivers are not perfect yet.

1 Like

@the-pale-king
setting NCCL_P2P_DISABLE=1 is all you need :wink:
best

Hi @nicolaspanel
Does setting NCCL_P2P_DISABLE=1 really work?
NCCL_P2P_DISABLE usually helps disable communications between GPUs, why here we need this?