RuntimeError: Found no NVIDIA driver on your system when running on NVIDIA A10G Large

Hi Guys,

So I created a space with NVIDIA A10G hardware and blank docker template, pushed my Dockerfile with the script to fine-tune my private model and I encounter this error

/home/admin/.local/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/admin/.local/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
Is CUDA available: False
Traceback (most recent call last):
  File "/app/train.py", line 20, in <module>
    print(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/admin/.local/lib/python3.11/site-packages/torch/cuda/__init__.py", line 674, in current_device
    _lazy_init()
  File "/home/admin/.local/lib/python3.11/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

I added some commands in my Dockerfile and checked the build logs and found that there is no NVIDIA GPU present on my space, attaching the logs

--> RUN lspci -vnn | egrep 'VGA|3D'
lspci: Unable to load libkmod resources: error -2
00:01.3 Non-VGA unclassified device [0000]: Intel Corporation 82371AB/EB/MB PIIX4 ACPI [8086:7113] (rev 08)
00:03.0 VGA compatible controller [0300]: Amazon.com, Inc. Device [1d0f:1111] (prog-if 00 [VGA controller])
DONE 0.0s

Does anybody else faced this issue?

Hi @safihaider, could you please share an example of your Dockerfile?
Here’s an example of a Dockerfile that is compatible with our GPU hardware. In general, using a docker image like FROM nvidia/cuda:12.0.0-cudnn8-devel-ubuntu22.04 is greatly advised.

Hi @radames ,

Yes I tried different GPU templates from nvidia/cuda including the one you mentioned and the error still exists. I checked the CUDA version and its available, this is the content of my Dockerfile:

FROM nvidia/cuda:12.2.0-runtime-ubuntu20.04
FROM nvcr.io/nvidia/pytorch:22.08-py3
RUN nvcc -V
RUN nvidia-smi

These are the build logs on my space:

===== Build Queued at 2023-09-03 08:19:07 / Commit SHA: 298bbcd

===== --> FROM nvcr.io/nvidia/pytorch:22.08-py3@sha256:1aa83e1a13f756f31dabf82bc5a3c4f30ba423847cb230ce8c515f3add88b262 

DONE 0.0s 

DONE 26.3s 

DONE 27.2s 

DONE 57.8s 

DONE 59.3s 

DONE 74.1s 

DONE 75.0s 

DONE 75.1s 

--> RUN nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Jun__8_16:49:14_PDT_2022 Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0

DONE 0.2s

--> RUN nvidia-smi
/bin/bash: nvidia-smi: command not found

--> ERROR: process "/bin/sh -c nvidia-smi" did not complete successfully: exit code: 127

As you can see CUDA is present but the nvidia-smi command could not be found.

hi @safihaider , I don’t think you can run nvidia-smi at Docker build time, since the machine building docker image doesn’t have GPU capacity. Once the image is deployed in the hardware with GPU then you can run these commands.