I’m working on a research project focused on machine translation between Polish and English, and I want to train a Transformer model from scratch to better understand the full process.
I’ve already experimented with the T5 architecture and used the OpenSubtitles parallel dataset (Polish ↔ English) for training. However, I’m running into some challenges:
-
It’s unclear whether I should use one shared tokenizer for both languages or build separate vocabularies.
-
When training from scratch, I’m not sure about the optimal preprocessing steps (tokenization, padding, truncation) for mixed-language data.
My questions:
-
What are the recommended steps to train a bilingual T5 model from scratch for translation tasks?
-
How should I handle tokenization and vocabulary sharing between Polish and English?
-
Are there training strategies or hyperparameters (e.g., learning rate, batch size, sequence length) known to help stabilize training on small or mid-sized datasets like OpenSubtitles?
Any advice, code snippets, or examples would be appreciated!