How to efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel?

Not that you can also load your dataset in streaming mode if you pass streaming=True to load_dataset. You can use the same map functions you used already, but everything will be computed on-the-fly like a torch DataPipe.

This will save you a lot of time and disk space :wink:

1 Like