Generate your own TV shows

:wave: Please read the topic category description to understand what this is all about

Description

Fine-tune an autoregressive Transformer model on the transcripts of your favourite TV show to generate new episodes!

Model(s)

A decoder-based Transformer model would be well suited for this task. You can find these models under the Text Generation models filter on the Hub, and the following would be a good place to start from:

  • GPT-2
  • GPT-Neo

Datasets

You can usually find the transcripts for TV shows with a quick Google search. For example:

  • here are the transcripts for the popular Rick and Morty series.
  • here are the transcripts for the The Simpsons series

Challenges

The transcripts are unlikely to come in a ready-to-use format for language modeling, so some data wrangling will be needed.

Desired project outcomes

  • Create a Streamlit or Gradio app on :hugs: Spaces that allows people to generate their own TV scripts from an input prompt.
  • Donā€™t forget to push all your models and datasets to the Hub so others can build on them!

Additional resources

  • Check out @merveā€™s cool Space on French story generation to get an idea about creating a Space for this project.
3 Likes

Hi everyone!

This seemed like a really fun project for me to get my hands dirty with Hugging Face and training GPT type models. Plus, as a Seinfeld fan I would love reading the output from this prompt. So, the idea is to give an episode name at the prompt, and then get (an episode with) fun Seinfeld dialogue (ambitious, I know). I have a little ML experience from several years ago but Iā€™m mostly a developer and researcher. Hereā€™s the steps that I took.

Before embarking I went over the HuggingFace intro course. That all seemed clear to me and I felt I was ready to give my own project a try. Best learning is by doing in this field.

First, I collected the Seinfeld scripts. I wrote a simple scraper that saved all the script files as .txt files. I got them from here: Seinfeld Transcripts

Next, I wanted to train the suggested model, GPT-2, with the Seinfeld script data. This is where things got a little confusing because I was unsure how to preprocess the data and in what format. For the prompt, I was fine if the bot simply generated some kind of story loosely related to a title. As long as it was somewhat coherent, I felt I could finetune this later. In this article Fine-tune a non-English GPT-2 Model with Huggingface, I found an example of how to train a model with free text data. I created a Python list with dicts that contained a ā€œTitleā€ and ā€œScriptā€ key. The script key contained the raw text with the script. I cleaned the text from whitelines and spaces.

The Python list with episodes data I split in a test and train set (20% test). For each set I then simply concatenated the script text files and saved it to a single file. I then used the GPT2Tokenizer (AutoTokenizer.from_pretrained(ā€œgpt2ā€) ) and created my custom dataset. (I still need to push the dataset with the correct format to the hub).

Then I tried to train the model with the instructions similar in the guide. I wasnā€™t quite sure what all the training args were supposed to do so I stuck to the defaults. Also, I never worked with Google Collab before so I had to figure that out (wasnā€™t too hard) and realized that training should be done with a GPUā€¦(lesson learned).

With the trained model ready, I could load a pipeline for ā€˜text-generationā€™. I passed it the gpt2 tokenizer and set the max_length=1000. I wrapped that model in a Gradio app and published that to Spaces.

Time to play! Unfortunately, the results are a bit so-so. At first glance it looks like Seinfeld dialogue but it really is gibberish and incoherent. Surely it can do better than what itā€™s doing now. I would like to build on this model and I would appreciate other peopleā€™s feedback on how to make it better.

  • Iā€™m not quite sure how to effectively feed the training data. My initial approach was to simply give it a single text file with one long list of strings with scripts. But perhaps I could train the model with dialogue per character? Would there be other approaches I could to try when training?
  • Are there alternative options in engineer the prompt, maybe not just an episode name?
  • I would like the model to generate more sentences than itā€™s currently doing. I have to figure out (i.e. read up on) how to do so. Any directions?
  • I would like to try different models. I used the Gpt2 model now, but perhaps Gpt-Neo gives better results. Would people recommend other models?
  • How could I improve the processing time of the prompt? Is that only a matter of improved (premium, paid) resources, or could I do something else?
  • Iā€™ll dive a bit more in figuring out what the training args do. Maybe a few tweaks there can make a big difference. For example, by training more epochs.
  • Are there other things I could consider?

Iā€™m happy to keep tinkering myself but any nods in the right direction would be much appreciated :)!

You can noodle with it here: Seinfeld Dialogue - a Hugging Face Space by Adam173 Do note that it takes about 70s to process a given prompt.