Hi everyone!
This seemed like a really fun project for me to get my hands dirty with Hugging Face and training GPT type models. Plus, as a Seinfeld fan I would love reading the output from this prompt. So, the idea is to give an episode name at the prompt, and then get (an episode with) fun Seinfeld dialogue (ambitious, I know). I have a little ML experience from several years ago but Iām mostly a developer and researcher. Hereās the steps that I took.
Before embarking I went over the HuggingFace intro course. That all seemed clear to me and I felt I was ready to give my own project a try. Best learning is by doing in this field.
First, I collected the Seinfeld scripts. I wrote a simple scraper that saved all the script files as .txt files. I got them from here: Seinfeld Transcripts
Next, I wanted to train the suggested model, GPT-2, with the Seinfeld script data. This is where things got a little confusing because I was unsure how to preprocess the data and in what format. For the prompt, I was fine if the bot simply generated some kind of story loosely related to a title. As long as it was somewhat coherent, I felt I could finetune this later. In this article Fine-tune a non-English GPT-2 Model with Huggingface, I found an example of how to train a model with free text data. I created a Python list with dicts that contained a āTitleā and āScriptā key. The script key contained the raw text with the script. I cleaned the text from whitelines and spaces.
The Python list with episodes data I split in a test and train set (20% test). For each set I then simply concatenated the script text files and saved it to a single file. I then used the GPT2Tokenizer (AutoTokenizer.from_pretrained(āgpt2ā) ) and created my custom dataset. (I still need to push the dataset with the correct format to the hub).
Then I tried to train the model with the instructions similar in the guide. I wasnāt quite sure what all the training args were supposed to do so I stuck to the defaults. Also, I never worked with Google Collab before so I had to figure that out (wasnāt too hard) and realized that training should be done with a GPUā¦(lesson learned).
With the trained model ready, I could load a pipeline for ātext-generationā. I passed it the gpt2 tokenizer and set the max_length=1000. I wrapped that model in a Gradio app and published that to Spaces.
Time to play! Unfortunately, the results are a bit so-so. At first glance it looks like Seinfeld dialogue but it really is gibberish and incoherent. Surely it can do better than what itās doing now. I would like to build on this model and I would appreciate other peopleās feedback on how to make it better.
- Iām not quite sure how to effectively feed the training data. My initial approach was to simply give it a single text file with one long list of strings with scripts. But perhaps I could train the model with dialogue per character? Would there be other approaches I could to try when training?
- Are there alternative options in engineer the prompt, maybe not just an episode name?
- I would like the model to generate more sentences than itās currently doing. I have to figure out (i.e. read up on) how to do so. Any directions?
- I would like to try different models. I used the Gpt2 model now, but perhaps Gpt-Neo gives better results. Would people recommend other models?
- How could I improve the processing time of the prompt? Is that only a matter of improved (premium, paid) resources, or could I do something else?
- Iāll dive a bit more in figuring out what the training args do. Maybe a few tweaks there can make a big difference. For example, by training more epochs.
- Are there other things I could consider?
Iām happy to keep tinkering myself but any nods in the right direction would be much appreciated :)!