I need help to run my code on mlm task

I am trying to train T5 with two layers on both decoder and decoder(2 layers encoder, 2layers decoder) on MLM task, I have already built my own model based on transformers.T5ForConditionalGeneration that is why I need to train this version on MLM not using jax model as here
that is why I tried to memic the code but using a pure torch and this is the link to my code. Ihave many questions:

  • using t5-small tokenizer I go Error(/opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [117,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.

  • I understand the difference between the pre-rained T5 models is the number of layers and consequently the number of parameters. But what is the difference then between the pre-trained tokenizers? I mean All models are pre-trained on c4, if the tokenizer is also trained on c4 corpus then why load the tokenizer with different names? Is the pre-trained tokenizer is the same for all models but when loading the pre-trained tokenizer we refer to the config of the pre-trained model which inside has the path to the same pre-trained tokenizer?
    Actually, I have tried the three tokenizers (small, base, big) to tokenize small samples of texts I did not notice any difference. Comparing the vocabulary of the three tokenizers I found that it is the same vocab for all tokenizers.
    Another question please and correct me if I am wrong. As to my knowledge, the tokenizer and data distribution go in parallel to train any model. If I want to pre-train the T5 model for different numbers of layers on Masked language modeling on (let us say any English text dataset from hugging face). Do I need to train the tokenizer for this corpus or it is enough to use pre trained T5 tokenizer?

  • I tried to train the tokenizer as here on wikitext train split(code in src folder) i do not have the error of indexing as above, but it freezes after lunching and I do not know why since nothing is written. You can find my code `here

Also using different tokenizers I got different numbers of examples to train(maybe cause the number of resulted tokens differs according to the tokenizer)