I am working to develop model which will translate from English to the language about which there is not much translated data. So I am thinking to pretrain model and then fine-tune on translation task. I read “attention is all you need” paper and concluded that they don’t use pretraining, which seems necessary if data is scarce. I am wondering if you know any paper, article or anything which will help me to acquire more knowledge about that topic. feel free to give as many suggestions as possible.