Training T5 on mlm task from scratch

The main used reference is here. I need to train T5 from hugging face from scratch on mlm task using pytorch. To my knowledge, there is no example to do that. The main issue that the same dataset preprocessing using the same T5 model but with two different frameworks flax and pytorch gave me different results. I did not change anything in the original code I just tried to use pytorch and Trainer instead. Everything is still as in the original code so why I am getting different results? I need torch version cause I have already built my model based on T5 from huggingface and I need also to train my model on mlm task and compare it with T5 from hugging face. That is why I started with T5 first as a baseline.
I have decided as the first step to use wikitext-103-raw-v1 dataset for pretraining.
The first question was in my mind which tokenizer to use so I have tried t5-small tokenizer to pretrain using the original script, then I trained the tokenizer on train split of wikitext-103-raw-v1 dataset .

  1. First issue was using the pretrained tokenizer on wikitext-103-raw-v1 dataset gave me better results and this raise another question in my mind , If I need to pretrain the model on mlm task then finetune it on another task, which tokenizer to use? I mean do I need to pretrain the tokenizer again and again evry time I will use new dataset? or simply uset 5-small tokenizer everywhere? or decide which datasets will be used in my experiements train the tokenizer on all train splits then do the pretraining and funetuning?
  2. Second Issue : trying to mimic using torch Trainer keeping the dataset preprocessing and collator class with no change resulted in unsatisfied results even I tried to train on 100 epochs, still using 10 epochs with original script gives better results. Can you please guid me to the reason? I do not need the flax version I need torch pipeline to train T5 on mlm task from scratch. Seems my try was not good

this is my try
to run the same steps using pytorch.
I have tried to use t5-small tokenizer. Also, I trained the given tokenizer in this repo on wikitext to compare.

The results are not the same, seems strange. Training on 10 epochs using :

  1. if tokenizer trained on wiki
    export CUDA_VISIBLE_DEVICES=0,1,2,3; python3 --output_dir="./ MLM-128wiki/wikitokenizer” --model_type=“t5” --config_name="./wikitext-103-raw-v1" --tokenizer_name="./wikitext-103-raw-v1" --dataset_name=“wikitext” --dataset_config_name=“wikitext-103-raw-v1” --max_seq_length=“128” --per_device_train_batch_size=“32” --per_device_eval_batch_size=“32” --adafactor --learning_rate=“0.005” --weight_decay=“0.001” --warmup_steps=“2000” --overwrite_output_dir --logging_steps=“500” --save_steps=“10000” --eval_steps=“500” --num_train_epochs=10

  2. if tokenizer is t5 -small tokenizer
    export CUDA_VISIBLE_DEVICES=0,1,2,3; python3 --output_dir="./ MLM-128wiki/t5-tokenizer” --model_type=“t5” --config_name="./wikitext-103-raw-v1" --tokenizer_name=“t5-small” --dataset_name=“wikitext” --dataset_config_name=“wikitext-103-raw-v1” --max_seq_length=“128” --per_device_train_batch_size=“32” --per_device_eval_batch_size=“32” --adafactor --learning_rate=“0.005” --weight_decay=“0.001” --warmup_steps=“2000” --overwrite_output_dir --logging_steps=“500” --save_steps=“10000” --eval_steps=“500” --num_train_epochs=10


                T5tokenizer                                                                  tokenizer trained on wiki

train loss: 2.307 ------ 2.074
eval loss: 2.254 ------ 1.959

using my code as following:
1. if tokenizer trained on wiki:

export CUDA_VISIBLE_DEVICES=0,1,2,3; python3 --output_dir="./torch/wiki" --model_type=“t5” --config_name="./wikitext-103-raw-v1" --tokenizer_name="./wikitext-103-raw-v1" --dataset_name=“wikitext” --dataset_config_name=“wikitext-103-raw-v1” --max_seq_length=“128” --per_device_train_batch_size=“32” --per_device_eval_batch_size=“32” --adafactor --learning_rate=“0.005” --weight_decay=“0.001” --warmup_steps=“2000” --logging_steps=“500” --save_steps=“10000” --eval_steps=“1000” --do_train --do_eval --do_predict --overwrite_output_dir --report_to=‘wandb’ --num_train_epochs=10 --evaluation_strategy steps

2. if tokenizer is t5 tokenizer:

export CUDA_VISIBLE_DEVICES=0,1,2,3; python3 --output_dir="./torch/t5tokenizer" --model_type=“t5” --config_name="./wikitext-103-raw-v1" --tokenizer_name=“t5-small” --dataset_name=“wikitext” --dataset_config_name=“wikitext-103-raw-v1” --max_seq_length=“128” --per_device_train_batch_size=“32” --per_device_eval_batch_size=“32” --adafactor --learning_rate=“0.005” --weight_decay=“0.001” --warmup_steps=“2000” --logging_steps=“500” --save_steps=“10000” --eval_steps=“1000” --do_train --do_eval --do_predict --overwrite_output_dir --report_to=‘wandb’ --num_train_epochs=10 --evaluation_strategy steps


                T5tokenizer                                                                  tokenizer trained on wiki

train loss: 4.675 ------ 3.961
eval loss: 4.562 ------ 3.8

1 Like

Could you please share with us if you find anything related to this issue?

Hi sadra, from my experiments I found that trained tokenizer always gives better results than using t5tokenizer for specific task.

1 Like

The main difference I see is that you’re using AdaFactor in the JAX script, but the default optimizer in the pytorch script (AdamW).

I am talking about comparision using the same optimizer, n the both experiements above I used adafactor)