Multiple time fine-tuning VideoMAE model adding n class each time

Hello Everyone, I have got a video dataset of sign language, which has got multiple sub folders of numbers, alphabets, nouns, verbs and adjectives. I have fine-tuned the VideoMAE model on numbers videos only, which gave me a good accuracy. Moreover, my task is to fine-tune the VideoMAE model on all dataset which is relatively huge with respect to the computing facilities I have. How should I proceed to fine-tune the VideoMAE model on all dataset ?
fine-tuning the model on classes of videos and having only one model at the end to work as classifier for my dataset.

Any help would be appreciated.