Hello, I have been trying to train a model (GTPNeoXJapanese2.7b) using a custom dataset and was able to get it training on windows with Visual Studio (though very slowly). But when I switched to VastAI jupyter cloud computing I was unable to get it to run correctly. Specifically, the cloud GPUs continually run out of memory even when using a machine with 4x RTX 4090’s or I receive a raise IndexError(f"Invalid key: {key} is out of bounds for size {size}") error when I use nn.DataParallel(model).
In the first place, the program doesn’t seem to use the other 3. But on windows, I was able to get the trainer to run with a single 3090. I have already tried about 5 different VastAI instances with different setups and have made sure to use CUDA, its matching pytorch version, etc.
I have also tried messing with batch sizes, gradation, uninstalling and reinstalling pytorch, using fp16 and so forth but nothing seems to work. Is there anything I’m missing that would cause my trainer to not work? If more information is needed, I will do my best to answer. Thank you very much!