How to efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel?

lhoestq · October 27, 2022, 8:42am

Not that you can also load your dataset in streaming mode if you pass streaming=True to load_dataset. You can use the same map functions you used already, but everything will be computed on-the-fly like a torch DataPipe.

This will save you a lot of time and disk space

Topic		Replies	Views
Can an EncoderModel be trained on top of a concatenation of BertModel [CLS] embeddings with additional input data using the transformers library? Intermediate	0	447	December 9, 2022
Machine Translation using Hugging Face problem Intermediate	0	323	May 8, 2023
Defining a custom dataset for fine-tuning translation Beginners	4	5088	July 10, 2021
Load EncoderDecoderModel from a checkpoint Models	0	294	March 9, 2023
How to train an EncoderDecoderModel with different pretrained encoder and decoder? 🤗Transformers	2	418	April 2, 2024

How to efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel?

Related topics