Generate your own TV shows

lewtun · November 9, 2021, 2:42pm

Please read the topic category description to understand what this is all about

Description

Fine-tune an autoregressive Transformer model on the transcripts of your favourite TV show to generate new episodes!

Model(s)

A decoder-based Transformer model would be well suited for this task. You can find these models under the Text Generation models filter on the Hub, and the following would be a good place to start from:

GPT-2
GPT-Neo

Datasets

You can usually find the transcripts for TV shows with a quick Google search. For example:

here are the transcripts for the popular Rick and Morty series.
here are the transcripts for the The Simpsons series

Challenges

The transcripts are unlikely to come in a ready-to-use format for language modeling, so some data wrangling will be needed.

Desired project outcomes

Create a Streamlit or Gradio app on Spaces that allows people to generate their own TV scripts from an input prompt.
Don’t forget to push all your models and datasets to the Hub so others can build on them!

Additional resources

Check out @merve’s cool Space on French story generation to get an idea about creating a Space for this project.

Adam173 · April 18, 2023, 10:43am

Hi everyone!

This seemed like a really fun project for me to get my hands dirty with Hugging Face and training GPT type models. Plus, as a Seinfeld fan I would love reading the output from this prompt. So, the idea is to give an episode name at the prompt, and then get (an episode with) fun Seinfeld dialogue (ambitious, I know). I have a little ML experience from several years ago but I’m mostly a developer and researcher. Here’s the steps that I took.

Before embarking I went over the HuggingFace intro course. That all seemed clear to me and I felt I was ready to give my own project a try. Best learning is by doing in this field.

First, I collected the Seinfeld scripts. I wrote a simple scraper that saved all the script files as .txt files. I got them from here: Seinfeld Transcripts

Next, I wanted to train the suggested model, GPT-2, with the Seinfeld script data. This is where things got a little confusing because I was unsure how to preprocess the data and in what format. For the prompt, I was fine if the bot simply generated some kind of story loosely related to a title. As long as it was somewhat coherent, I felt I could finetune this later. In this article Fine-tune a non-English GPT-2 Model with Huggingface, I found an example of how to train a model with free text data. I created a Python list with dicts that contained a “Title” and “Script” key. The script key contained the raw text with the script. I cleaned the text from whitelines and spaces.

The Python list with episodes data I split in a test and train set (20% test). For each set I then simply concatenated the script text files and saved it to a single file. I then used the GPT2Tokenizer (AutoTokenizer.from_pretrained(“gpt2”) ) and created my custom dataset. (I still need to push the dataset with the correct format to the hub).

Then I tried to train the model with the instructions similar in the guide. I wasn’t quite sure what all the training args were supposed to do so I stuck to the defaults. Also, I never worked with Google Collab before so I had to figure that out (wasn’t too hard) and realized that training should be done with a GPU…(lesson learned).

With the trained model ready, I could load a pipeline for ‘text-generation’. I passed it the gpt2 tokenizer and set the max_length=1000. I wrapped that model in a Gradio app and published that to Spaces.

Time to play! Unfortunately, the results are a bit so-so. At first glance it looks like Seinfeld dialogue but it really is gibberish and incoherent. Surely it can do better than what it’s doing now. I would like to build on this model and I would appreciate other people’s feedback on how to make it better.

I’m not quite sure how to effectively feed the training data. My initial approach was to simply give it a single text file with one long list of strings with scripts. But perhaps I could train the model with dialogue per character? Would there be other approaches I could to try when training?
Are there alternative options in engineer the prompt, maybe not just an episode name?
I would like the model to generate more sentences than it’s currently doing. I have to figure out (i.e. read up on) how to do so. Any directions?
I would like to try different models. I used the Gpt2 model now, but perhaps Gpt-Neo gives better results. Would people recommend other models?
How could I improve the processing time of the prompt? Is that only a matter of improved (premium, paid) resources, or could I do something else?
I’ll dive a bit more in figuring out what the training args do. Maybe a few tweaks there can make a big difference. For example, by training more epochs.
Are there other things I could consider?

I’m happy to keep tinkering myself but any nods in the right direction would be much appreciated :)!

Adam173 · April 18, 2023, 10:43am

You can noodle with it here: Seinfeld Dialogue - a Hugging Face Space by Adam173 Do note that it takes about 70s to process a given prompt.

Topic		Replies	Views
Creating a Rick Sanchez chat bot with Transformers and Chai Beginners	4	1594	August 9, 2021
Fine tune the text generation with gpt2 Beginners	2	441	February 22, 2023
Tutorial notebooks 🤗Transformers	9	1611	February 14, 2022
BART with custom encoder and decoder Models	5	921	May 25, 2023
How to create a dataset for "audio-like" files for ASR Beginners	0	402	April 10, 2023