I have received my brand new M3 max, and discovered sadly that BitsAndBytes is not supported, So I had to adapt my training code to fine tune Mistral on my dataset.
=:> Changed the device to the proper device
=> Remove the bnb config
=> remove the load 4 / 8 bit to true or false
=> change the optim to AdamW_torch (my previous was a paged 32b and so used bitsAndBytes)
=> Changed the batch to 10… because I do what I want, it’s my life
It started to process with an estimated time of 50min, (6hrs on my TitanX)
Loss was decreasing as usual… below 1, and suddenly up to 4, then 0…
Same code is running fine on my Titan
Am I the only one to experience this kind of behaviour on Apple Silicon ?
Is anyone using transformers (FFTrainer) to fine tune on mac Mx ?