Training DistilGPT2

abhilashpal · July 18, 2020, 3:28pm

Hello!
I am trying to find resources/code samples to retrain the DistilGPT2 model with text I have preprocessed myself, but could not find any. Most of the documentation relates to DistilBert and it’s uses.
Furthermore, I also have trained a gpt2-simple (tensorflow based) model. If there is any way to distil the same, it will help me too!
Thanks for your help.

valhalla · July 18, 2020, 4:43pm

Hi @abhilashpal, you can find distillation code here. The same script that produces distillbert can be used for GPT-2, it’s not documented though.

you should be able to use this command after processing your dataset

python train.py \
    --student_type gpt2 \
    --student_config training_configs/distilgpt2.json \
    --teacher_type gpt2 \
    --teacher_name gpt2 \ # or your own teacher model
    --alpha_ce 5.0  --alpha_cos 1.0 --alpha_clm 0.5 \
    --freeze_pos_embs \
    --dump_path serialization_dir/my_first_training \
    --data_file data/binarized_text.bert-base-uncased.pickle \ # your data path
    --token_counts data/token_counts.bert-base-uncased.pickle \ # your own pickle file path
    --force # overwrites the `dump_path` if it already exists.

pinging @julien-c for more info

abhilashpal · July 19, 2020, 2:07pm

Thanks for the reply. I ran into one problem whilst running the distilbert binarization. Can anyone tell me if this means that my data per line exceeds what the distilbert model expects?

!python scripts/binarized_data.py \
--file_path data/dataemail.txt \
--tokenizer_type bert \
--tokenizer_name bert-base-uncased \
--dump_file data/binarized_text

07/19/2020 07:25:56 - WARNING - transformers.tokenization_utils_base - Token indices sequence length is longer than the specified maximum sequence length for this model (626 > 512). Running this sequence through the model will result in indexing errors

aclifton314 · July 20, 2020, 8:59pm

@abhilashpal, that’s what it looks like to me. I don’t have any direct experience with the distillation examples, but I took a quick look at the Distillbert and Bert paper and 512 looks like their max token length.

eddieone · October 13, 2020, 9:50pm

I ran the previous command mentioned on the readme and it did output the binary file. However, another file is missing, token_counts.distilgpt2.pickle . How to generate that file as well?

On the token counter script, it returns IndexError: list assignment index out of range Someone know how to resolve that error?

Topic		Replies	Views
Distilgpt2 model Beginners	0	50	August 1, 2024
Language training on a model Beginners	1	402	August 27, 2023
How to save tokenizer after finetunning distilgpt2 model Beginners	2	586	March 18, 2022
GPT2 Training from scratch in German 🤗Transformers	3	2312	October 3, 2020
Distilgpt2 pre-training configuration Beginners	3	1189	April 8, 2021

Training DistilGPT2

Related topics