+ huge dataset

How can I use iterator to run ((transformers\examples\pytorch\summarization\ with huge data files (in JSON Lines format, size> 20 GB) to avoid loading the whole file into the memory? Any suggestions how to use the code with large size JSON files?


Transformers 4.20.1

Pytorch 1.11.0+cu113

Datasets 2.3.2

Tokenizers 0.12.1

I would split the json file in 2 or 3 parts and do the training in 2 or 3 batches. You also have the batch size --device_train_batch_size maybe you can play with it.