Run_summarization.py + huge dataset

AtheerAlgherairy · February 14, 2023, 9:50am

How can I use iterator to run ((transformers\examples\pytorch\summarization\run_summarization.py)) with huge data files (in JSON Lines format, size> 20 GB) to avoid loading the whole file into the memory? Any suggestions how to use the code with large size JSON files?

Framework:

Transformers 4.20.1

Pytorch 1.11.0+cu113

Datasets 2.3.2

Tokenizers 0.12.1

luisbnzsa · March 1, 2023, 4:08pm

I would split the json file in 2 or 3 parts and do the training in 2 or 3 batches. You also have the batch size --device_train_batch_size maybe you can play with it.

Topic		Replies	Views
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	7	2228	November 11, 2024
Summarization pipeline on long text Beginners	6	4497	December 14, 2022
Big text dataset loading for training 🤗Datasets	2	85	May 7, 2025
Run_qa.py with custom dataset seems to expect batch size of 1000 but receives batch size of 1362 Beginners	0	499	March 25, 2022
Using an IterableDataset for 1+ epochs in Trainer Beginners	3	131	January 2, 2025

Run_summarization.py + huge dataset

Related topics