AutoTrain Advanced UI CUDA out of memory error

I am using the AutoTrain Advanced UI feature for training the Mixtral-8X7B-Instruct-v0.1 model. I have upgraded my hardware of space to Nvidia 4XA10G Large which has 184 GB RAM and 96 GB VRAM.

I think this is powerful hardware to train my small data set. Still, I am facing the following error:

ERROR  | 2024-01-08 13:15:32 | autotrain.trainers.common:wrapper:90 - train has failed due to an exception: Traceback (most recent call last):
  File "/app/src/autotrain/trainers/common.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/app/src/autotrain/trainers/clm/__main__.py", line 186, in train
    model = AutoModelForCausalLM.from_pretrained(
  File "/app/env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/app/env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3694, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/app/env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4104, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/app/env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 786, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(
  File "/app/env/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
  File "/app/env/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 179, in to
    return self.cuda(device)
  File "/app/env/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 157, in cuda
    w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
  File "/app/env/lib/python3.10/site-packages/bitsandbytes/functional.py", line 812, in quantize_4bit
    absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 19.06 MiB is free. Process 14595 has 21.96 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 174.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

❌ ERROR  | 2024-01-08 13:15:32 | autotrain.trainers.common:wrapper:91 - CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 19.06 MiB is free. Process 14595 has 21.96 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 174.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have changed various hardware but still getting this error. Please help me to figure out what am I doing wrong.

Thanks in advance!!

you probably need 8xA100 for this model

Thanks @abhishek for your quick response. I updated the following settings:

Now, I am getting this error:

INFO:     10.16.18.44:34499 - "POST /create_project HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/app/env/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/app/env/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/app/env/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/app/env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/app/env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/app/env/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/app/env/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/app/env/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/app/env/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/app/env/lib/python3.10/site-packages/fastapi/routing.py", line 274, in app
    raw_response = await run_endpoint_function(
  File "/app/env/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/app/src/autotrain/app.py", line 427, in handle_form
    job_id = project.create()
  File "/app/src/autotrain/project.py", line 71, in create
    return self.create_spaces()
  File "/app/src/autotrain/project.py", line 64, in create_spaces
    space_id = sr.prepare()
  File "/app/src/autotrain/backend.py", line 103, in prepare
    space_id = self._create_space()
  File "/app/src/autotrain/backend.py", line 247, in _create_space
    endpoint_id = self._create_endpoint()
  File "/app/src/autotrain/backend.py", line 210, in _create_endpoint
    return r.json()["name"]
KeyError: 'name'

This dataset is working with 1XA100 but it’s throwing the above error with 8XA100. Am I doing anything wrong?

Thanks.

you dont seem to have access to 8xA100. you need to write an email to api-enterprise@hf.co to get access to them.

For your original issue I think I came across this error. I’m running a g5.xlarge (A10G). You are running Linux? Check this link. The Option 3: GRID is what helped me get through it.

You may need to install that after the Nvidia drivers are installed. Good luck!

Thanks @pcapazzi. I was using AutoTrain Advanced UI. I reached out to the HF support. They said as of now I can’t have access to 8xA100 hardware so, I need to move on to other resource providers like AWS. I will definitely try your suggestion. I will update my findings here so that others can be benefited from it.

you can also just deploy autotrain ui on your own local/aws machine. just export AUTOTRAIN_LOCAL=1 and your HF_TOKEN, install autotrain-advanced and run ‘autotrain app’. it will run and train models locally.

1 Like