RTX 4090 Huggingface Trainer Compatible?

Hi, I’m trying to train a Huggingface model using Pytorch with an NVIDIA RTX 4090.

The training worked well previously on an RTX 3090.

Currently I am finding that INFERENCE works well on the 4090, but training hangs at 0% progress.

I am training inside this docker container: nvcr.io/nvidia/pytorch:22.09-py3

Here is the output of NVIDIA-SMI … this output seems to be identical both inside and outside of the docker container.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+

My guess is that the problem is with Pytorch, maybe it has not implemented the necessary code to use driver 520, which is required for the RTX 4090?

thanks

Actually, when my container boots, I get this warning:

WARNING: Detected NVIDIA NVIDIA GeForce RTX 4090 GPU, which is not yet supported in this version of the container

So that’s pretty clear. Has anyone been able to do training inside a docker container with the RTX 4090?

thanks

Just tried with the newest NVIDIA-pytorch 22.10-py3 – released 10/28/2022 – same behavior – hangs with 100% CPU (just using one CPU out of 48 in a Ryzen threadripper) … and 0% training progress.

Just tried it without docker, exact same results. My current guess is that:

  1. RTX 4090 requires NVIDIA driver 4.20, which uses cuda 11.8
  2. Pytorch is not yet compatible with cuda 11.8

Again, inference works fine with Huggingface tools. Training hangs with 100% CPU and no progress.

If anyone finds different, please let me know – or if you would like to me to run a certain test case on the rtx 4090, I’d be happy to do so.

Hi,

Yes that always happen with new hardware, usually the solution is to download a new container from Nvidia.

Check nvidia ngc and look for the latest version of pytorch, then run your code inside that container and that should do it.

Thanks and if any questions, please let us know.

To follow up – I have TWO rtx 4090s in this system, and that seems to be related to the problem. I ran the docker container again, with CUDA_VISIBLE_DEVICES: 0 and NVIDIA_VISIBLE_DEVICES: 0 as environment variables, and training worked well on one GPU.

1 Like