How can I train a Polish-English translation Transformer model from scratch using PyTorch or Hugging Face?

I’m working on a research project focused on machine translation between Polish and English, and I want to train a Transformer model from scratch to better understand the full process.

I’ve already experimented with the T5 architecture and used the OpenSubtitles parallel dataset (Polish ↔ English) for training. However, I’m running into some challenges:

  • It’s unclear whether I should use one shared tokenizer for both languages or build separate vocabularies.

  • When training from scratch, I’m not sure about the optimal preprocessing steps (tokenization, padding, truncation) for mixed-language data.

My questions:

  1. What are the recommended steps to train a bilingual T5 model from scratch for translation tasks?

  2. How should I handle tokenization and vocabulary sharing between Polish and English?

  3. Are there training strategies or hyperparameters (e.g., learning rate, batch size, sequence length) known to help stabilize training on small or mid-sized datasets like OpenSubtitles?

Any advice, code snippets, or examples would be appreciated!

1 Like