How to save tokenizer after finetunning distilgpt2 model

I am finetuning the distilgpt2 model with my own dataset using clm.

cmd = '''
python transformers/examples/tensorflow/language-modeling/run_clm.py \
    --model_name_or_path distilgpt2 \
    --train_file {0} \
    --do_train \
    --num_train_epochs 3 \
    --overwrite_output_dir \
    --per_device_train_batch_size 2 \
    --output_dir {1}
'''.format(file_name, weights_dir)

But problem is that training is happening successfully but in between this error shows up: Could not locate tokenizer configuration file, will try to use model config file.

The model is saved and config.json is also saved, But when tokenizer is not there, when I tried to load the model it shows tokenizer not found. Can anyone help me to finetunning the distilgpt2 model?

Hello :wave:

Looking at the script you can give

tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )

in the arguments. Otherwise it will use distilgpt2's own tokenizer config. Let me know if this isn’t helpful.

Hey @merve thanks for the help! I tried with pytorch based script then it works! will also try your approach with tensorflow script! Also can you help me regarding text generation, when I trained the model, it shows |endoftext| after every result, how to remove it! that is just a marker to the model to stop looking after that!