Multi GPU Training with Trainer and TokenClassification Model

hello! i’m trying to finetune an “AutoModelForTokenClassification” using the Trainer class, but I keep running into the error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.75 GiB total capacity; 9.12 GiB already allocated; 10.69 MiB free; 9.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

However, using nvidia-smi and torch.cuda.device_count() it shows that I have 4 Nvidia Geforce RTX 2080s available.

Things I have tried to remedy this, all of which have failed:
(1)
in a bash script:

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1,2,3
python3 -m ensemble_method_testing

(2)
in my python file:

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"

(3)
in a bash script:

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1,2,3
python3 -m torch.distributed.launch \
    --nproc_per_node 4 ensemble_method_testing.py

This one ran into ChildFailedError
(4)
In my python code:

model = torch.nn.DataParallel(model)
model.to(device)

This one ran into a KeyError because I think my Dataset wasn’t being split, but I want to use the Trainer class so this isn’t the move I think
(5)
in model.from_pretrained:
adding “device_map” flags
Doesn’t work with the model I chose (“bert-base-german-cased”)
(6)
in my python code:
removing torch.cuda.device_count() and tf.test.gpu_device_name()
I don’t even know if this affects anything but I did that

(7)
a combination of (1) and (2)

1 Like