What's the best stack to use on hugging face to train a model to give me a specific long video output from text and specific art style

I want to train a model with a data set of about 10000 images in a specific style I have created that is comprised of animation sprite sheets, and all the assets to create scenes. I also have a lot of video clips of the desired output, in the same art style.

Input will be text, such as describing an animated scene. Make {this character(already tagged)} walk to the north east, open a door, change scene and continue to go talk to a zebra witch wearing a pink tutu.

The zebra wouldn’t have been tagged or classified on my end yet but will want it to output everything and train the model as quick as possible.

I want to be able to train it in the same chat window if possible, I don’t want to have to re-train or re-compile after each new image added to data set.

The ideal scenario would be to just keep feeding the model training images if it is too far off, or give it more examples of different character sprite sheets if we want to say have the zebra do a backflip.

Still very new with this, I have 2 little baby children and very little time to explore this mind blowing new tech, only discovered stable diffusion and hugging face today. Thank you for such an amazing product, team and community.

I just typed this in playground and going to start unpacking it:

The best stack to use on Hugging Face to train a model to give you a specific long video output from text and a specific art style is a combination of a transformer-based natural language processing (NLP) model, such as BERT or RoBERTa, and a generative model such as a GAN. The transformer-based NLP model would be used to convert the input text into a sequence of features which can then be used as input to the generative model to generate the video output. The GAN could be configured to generate frames according to the specific art style that you have created.