Training T5 on mlm task from scratch

Arij · December 7, 2021, 4:00pm

The main used reference is here. I need to train T5 from hugging face from scratch on mlm task using pytorch. To my knowledge, there is no example to do that. The main issue that the same dataset preprocessing using the same T5 model but with two different frameworks flax and pytorch gave me different results. I did not change anything in the original run_mlm_flax.py code I just tried to use pytorch and Trainer instead. Everything is still as in the original code so why I am getting different results? I need torch version cause I have already built my model based on T5 from huggingface and I need also to train my model on mlm task and compare it with T5 from hugging face. That is why I started with T5 first as a baseline.
I have decided as the first step to use wikitext-103-raw-v1 dataset for pretraining.
The first question was in my mind which tokenizer to use so I have tried t5-small tokenizer to pretrain using the original script, then I trained the tokenizer on train split of wikitext-103-raw-v1 dataset .

First issue was using the pretrained tokenizer on wikitext-103-raw-v1 dataset gave me better results and this raise another question in my mind , If I need to pretrain the model on mlm task then finetune it on another task, which tokenizer to use? I mean do I need to pretrain the tokenizer again and again evry time I will use new dataset? or simply uset 5-small tokenizer everywhere? or decide which datasets will be used in my experiements train the tokenizer on all train splits then do the pretraining and funetuning?
Second Issue : trying to mimic run_mlm_flax.py using torch Trainer keeping the dataset preprocessing and collator class with no change resulted in unsatisfied results even I tried to train on 100 epochs, still using 10 epochs with original script gives better results. Can you please guid me to the reason? I do not need the flax version I need torch pipeline to train T5 on mlm task from scratch. Seems my try was not good

this is my try
to run the same steps using pytorch.
I have tried to use t5-small tokenizer. Also, I trained the given tokenizer in this repo on wikitext to compare.

The results are not the same, seems strange. Training on 10 epochs using :

if tokenizer trained on wiki
export CUDA_VISIBLE_DEVICES=0,1,2,3; python3 run_t5_mlm_flax.py --output_dir=“./ MLM-128wiki/wikitokenizer” --model_type=“t5” --config_name=”./wikitext-103-raw-v1" --tokenizer_name=“./wikitext-103-raw-v1” --dataset_name=“wikitext” --dataset_config_name=“wikitext-103-raw-v1” --max_seq_length=“128” --per_device_train_batch_size=“32” --per_device_eval_batch_size=“32” --adafactor --learning_rate=“0.005” --weight_decay=“0.001” --warmup_steps=“2000” --overwrite_output_dir --logging_steps=“500” --save_steps=“10000” --eval_steps=“500” --num_train_epochs=10
if tokenizer is t5 -small tokenizer
export CUDA_VISIBLE_DEVICES=0,1,2,3; python3 run_t5_mlm_flax.py --output_dir=“./ MLM-128wiki/t5-tokenizer” --model_type=“t5” --config_name=”./wikitext-103-raw-v1" --tokenizer_name=“t5-small” --dataset_name=“wikitext” --dataset_config_name=“wikitext-103-raw-v1” --max_seq_length=“128” --per_device_train_batch_size=“32” --per_device_eval_batch_size=“32” --adafactor --learning_rate=“0.005” --weight_decay=“0.001” --warmup_steps=“2000” --overwrite_output_dir --logging_steps=“500” --save_steps=“10000” --eval_steps=“500” --num_train_epochs=10

results

                T5tokenizer                                                                  tokenizer trained on wiki
train loss: 2.307 ------ 2.074
eval loss: 2.254 ------ 1.959

using my code as following:
1. if tokenizer trained on wiki:

export CUDA_VISIBLE_DEVICES=0,1,2,3; python3 rum_mlm_torch.py --output_dir=“./torch/wiki” --model_type=“t5” --config_name=“./wikitext-103-raw-v1” --tokenizer_name=“./wikitext-103-raw-v1” --dataset_name=“wikitext” --dataset_config_name=“wikitext-103-raw-v1” --max_seq_length=“128” --per_device_train_batch_size=“32” --per_device_eval_batch_size=“32” --adafactor --learning_rate=“0.005” --weight_decay=“0.001” --warmup_steps=“2000” --logging_steps=“500” --save_steps=“10000” --eval_steps=“1000” --do_train --do_eval --do_predict --overwrite_output_dir --report_to=‘wandb’ --num_train_epochs=10 --evaluation_strategy steps

2. if tokenizer is t5 tokenizer:

export CUDA_VISIBLE_DEVICES=0,1,2,3; python3 rum_mlm_torch.py --output_dir=“./torch/t5tokenizer” --model_type=“t5” --config_name=“./wikitext-103-raw-v1” --tokenizer_name=“t5-small” --dataset_name=“wikitext” --dataset_config_name=“wikitext-103-raw-v1” --max_seq_length=“128” --per_device_train_batch_size=“32” --per_device_eval_batch_size=“32” --adafactor --learning_rate=“0.005” --weight_decay=“0.001” --warmup_steps=“2000” --logging_steps=“500” --save_steps=“10000” --eval_steps=“1000” --do_train --do_eval --do_predict --overwrite_output_dir --report_to=‘wandb’ --num_train_epochs=10 --evaluation_strategy steps

results:

                T5tokenizer                                                                  tokenizer trained on wiki
train loss: 4.675 ------ 3.961
eval loss: 4.562 ------ 3.8

sadra · February 8, 2022, 4:17pm

Could you please share with us if you find anything related to this issue?

Arij · July 25, 2022, 12:22pm

Hi sadra, from my experiments I found that trained tokenizer always gives better results than using t5tokenizer for specific task.

cmcmaster · July 29, 2022, 2:05am

The main difference I see is that you’re using AdaFactor in the JAX script, but the default optimizer in the pytorch script (AdamW).

Arij · July 29, 2022, 6:15am

I am talking about comparision using the same optimizer, n the both experiements above I used adafactor)

Topic		Replies	Views
I need help to run my code on mlm task Beginners	0	371	November 17, 2021
Pre-training googlebyt5small 🤗Transformers	0	228	October 26, 2022
Example of how to pretrain T5? 🤗Transformers	15	16033	March 16, 2023
Fine tuning T5 Encoder and T5 Decoder separately 🤗Transformers	1	737	May 6, 2024
T5 Finetuning Tips Models	48	56702	November 3, 2024

Training T5 on mlm task from scratch

Related topics