Having Trouble Going from Windows to VastAI Cloud Training

GraysonB · August 31, 2023, 6:27pm

Hello, I have been trying to train a model (GTPNeoXJapanese2.7b) using a custom dataset and was able to get it training on windows with Visual Studio (though very slowly). But when I switched to VastAI jupyter cloud computing I was unable to get it to run correctly. Specifically, the cloud GPUs continually run out of memory even when using a machine with 4x RTX 4090’s or I receive a raise IndexError(f"Invalid key: {key} is out of bounds for size {size}") error when I use nn.DataParallel(model).

In the first place, the program doesn’t seem to use the other 3. But on windows, I was able to get the trainer to run with a single 3090. I have already tried about 5 different VastAI instances with different setups and have made sure to use CUDA, its matching pytorch version, etc.

I have also tried messing with batch sizes, gradation, uninstalling and reinstalling pytorch, using fp16 and so forth but nothing seems to work. Is there anything I’m missing that would cause my trainer to not work? If more information is needed, I will do my best to answer. Thank you very much!

Topic		Replies	Views
Multi GPU Training with Trainer and TokenClassification Model 🤗Transformers	0	1517	July 21, 2023
Instructpix2pix training guide please 🧨 Diffusers	2	339	May 30, 2023
Trainer object high memory usage on Google Cloud Platform Workbench instance 🤗Transformers	0	31	September 16, 2024
Can't load huge model onto multiple GPU's Beginners	5	5183	June 15, 2023
How to get the Trainer API to use GPU? Beginners	0	1561	May 21, 2021

Having Trouble Going from Windows to VastAI Cloud Training

Related topics