ByT5 data preparation and pretraining

mondonomo · November 11, 2022, 1:01pm

I’m trying to pre-train the ByT5 model for a specific domain (organisational names). My main goal is to have a model that can be fine-tuned for semantic search (using the encoder only), language identification, organisation type classification and token classification. I prepared a training dataset with 200 million examples with the organisation name, languages and type (mostly from Wikidata and similar sources). The maximum length of the inputs is 100 bytes (average 44), and the number of classes is 1431.

It would be great if somebody could advise me on the best practices, specifically:

Should I repeat the denoising task if I continue training from the published model?
As I plan to use the encoder for other tasks, like [[2108.08877] Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models](https://in this paper), should I for training classification prefix labels (e.g. “class: company en_Latn”) and keep the inputs clear, without prefixing?
Should I start with a training base or a small model? I’m training on a single V100, and one epoch (classification) takes about 60 hours for a small model.

Thanks

Topic		Replies	Views
Continue pretraining on a released model Beginners	0	802	January 1, 2024
Pretraining T5 from scratch using MLM Models	1	395	December 6, 2024
Fine tune LongT5 mdoel Models	4	920	December 15, 2022
Pretrain and Fine Tune Byte-level model for multilingual extractive QA (Like ByT5) Flax/JAX Projects	13	1986	July 2, 2021
Steps to train T5 on collections of tags Research	0	675	June 1, 2022

ByT5 data preparation and pretraining

Related topics