I’m trying to pre-train the ByT5 model for a specific domain (organisational names). My main goal is to have a model that can be fine-tuned for semantic search (using the encoder only), language identification, organisation type classification and token classification. I prepared a training dataset with 200 million examples with the organisation name, languages and type (mostly from Wikidata and similar sources). The maximum length of the inputs is 100 bytes (average 44), and the number of classes is 1431.
It would be great if somebody could advise me on the best practices, specifically:
- Should I repeat the denoising task if I continue training from the published model?
- As I plan to use the encoder for other tasks, like [[2108.08877] Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models](https://in this paper), should I for training classification prefix labels (e.g. “class: company en_Latn”) and keep the inputs clear, without prefixing?
- Should I start with a training base or a small model? I’m training on a single V100, and one epoch (classification) takes about 60 hours for a small model.