I need help to run my code on mlm task

Arij · November 17, 2021, 7:43am

I am trying to train T5 with two layers on both decoder and decoder(2 layers encoder, 2layers decoder) on MLM task, I have already built my own model based on transformers.T5ForConditionalGeneration that is why I need to train this version on MLM not using jax model as here
that is why I tried to memic the code but using a pure torch and this is the link to my code. Ihave many questions:

using t5-small tokenizer I go Error(/opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [117,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.
)
I understand the difference between the pre-rained T5 models is the number of layers and consequently the number of parameters. But what is the difference then between the pre-trained tokenizers? I mean All models are pre-trained on c4, if the tokenizer is also trained on c4 corpus then why load the tokenizer with different names? Is the pre-trained tokenizer is the same for all models but when loading the pre-trained tokenizer we refer to the config of the pre-trained model which inside has the path to the same pre-trained tokenizer?
Actually, I have tried the three tokenizers (small, base, big) to tokenize small samples of texts I did not notice any difference. Comparing the vocabulary of the three tokenizers I found that it is the same vocab for all tokenizers.
Another question please and correct me if I am wrong. As to my knowledge, the tokenizer and data distribution go in parallel to train any model. If I want to pre-train the T5 model for different numbers of layers on Masked language modeling on (let us say any English text dataset from hugging face). Do I need to train the tokenizer for this corpus or it is enough to use pre trained T5 tokenizer?
I tried to train the tokenizer as here on wikitext train split(code in src folder) i do not have the error of indexing as above, but it freezes after lunching and I do not know why since nothing is written. You can find my code `here

Also using different tokenizers I got different numbers of examples to train(maybe cause the number of resulted tokens differs according to the tokenizer)

Topic		Replies	Views
Training T5 on mlm task from scratch 🤗Transformers	4	3263	July 29, 2022
Pretraining T5 from scratch using MLM Models	1	393	December 6, 2024
Prepare data for pretraining T5 model 🤗Datasets	1	1067	May 4, 2023
Problem generating with T5ForConditionalGeneration on a custom task 🤗Transformers	2	40	January 26, 2025
Transformer for Translation from Scratch with Hugging Face/PyTorch Intermediate	5	3785	December 1, 2022

I need help to run my code on mlm task

Related topics