Fine-tuning Mistral/Mixtral for sequence classification on long context

SkazaAl · March 30, 2024, 11:37pm

Hello.

I would like fine-tune Mistral or if possible Mixtral for classification of long sequences if it is possible up to 32k context. While this models need a lot of memory to train on their own, if I understand correctly needed memory increases quadratically, for which reason I run out of memory as I try to increase the context.

For this reason I tried running it on A100, as well as using quantitization and LoRA, which enabled me, to run the code, but as I increase the context I get error that I ran out of memory.

I started looking at ZeRO implementation with deepspeed and accelerate, model and pipeline parallelism and how to implement it on multiple A100. But due to being quite new to to this I am not really sure how to implement this and if this will resolve my problem.

I would be grateful for any advice if I am going into the right direction or how should I approach this/ is there any good example of implementation or anything else.

Thank you!

dbshift · May 29, 2024, 4:37am

hey SkazaAl,
did you find any way to do it. because i was also stuck at the point where you are. finally, i wanted to try with 4xL4 gpu but failed to implement parallel training. you get any resolution please share it here. i might be very healpfull.

Thanks.

SkazaAl · May 29, 2024, 7:27pm

Hello,

no I did’t find a solution, that would be efficient enough to run on limited number of 80GB GPUS. Based on the number of GPUs needed to train Mistral which was trained on 4000 context it needed 500 GPUs (based on information I found with the CEO) I concluded that the use of methods to lower GPU consumption won’t compensate for increased requirement from longer context.

I got training to work using Accelerate documentation on smaller contexts, but I don’t know if my implementation optimally distributed the training.

Topic		Replies	Views
Mistral for sequence classificaiton Beginners	1	484	April 20, 2024
Fine tuned Mistral 7B inference issue for >4k context length token with transformer 4.35+ 🤗Transformers	0	556	December 11, 2023
Mistral-7B-v0.1 finetuning results in Out-of-Memory after some iterations Models	2	1198	January 19, 2024
Mistral 7B FineTuning with Interview Data Models	4	6143	March 5, 2024
Deployment of finetuned Mistral for Classification and Generation Intermediate	4	347	June 10, 2024

Fine-tuning Mistral/Mixtral for sequence classification on long context

Related topics