Not that you can also load your dataset in streaming mode if you pass streaming=True
to load_dataset
. You can use the same map
functions you used already, but everything will be computed on-the-fly like a torch DataPipe.
This will save you a lot of time and disk space