Fine-tuning Mistral/Mixtral for sequence classification on long context


I would like fine-tune Mistral or if possible Mixtral for classification of long sequences if it is possible up to 32k context. While this models need a lot of memory to train on their own, if I understand correctly needed memory increases quadratically, for which reason I run out of memory as I try to increase the context.

For this reason I tried running it on A100, as well as using quantitization and LoRA, which enabled me, to run the code, but as I increase the context I get error that I ran out of memory.

I started looking at ZeRO implementation with deepspeed and accelerate, model and pipeline parallelism and how to implement it on multiple A100. But due to being quite new to to this I am not really sure how to implement this and if this will resolve my problem.

I would be grateful for any advice if I am going into the right direction or how should I approach this/ is there any good example of implementation or anything else.

Thank you!

hey SkazaAl,
did you find any way to do it. because i was also stuck at the point where you are. finally, i wanted to try with 4xL4 gpu but failed to implement parallel training. you get any resolution please share it here. i might be very healpfull.



no I did’t find a solution, that would be efficient enough to run on limited number of 80GB GPUS. Based on the number of GPUs needed to train Mistral which was trained on 4000 context it needed 500 GPUs (based on information I found with the CEO) I concluded that the use of methods to lower GPU consumption won’t compensate for increased requirement from longer context.

I got training to work using Accelerate documentation on smaller contexts, but I don’t know if my implementation optimally distributed the training.