Speeding up Streaming of Large Datasets (FineWeb)?

Merci, that worked!

On a tangent,I got the following error ~ a day into streaming fineweb:

requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(1398776 bytes read, 3882472 more expected)', IncompleteRead(1398776 bytes read, 3882472 more expected))

I looked at the Fineweb repo and it doesn’t look like it is using a custom loader like for RedPajamas (Error Handling in IterableDataset?). Could I use the Dataloader to handle errors when loading data or is there a better of doing that? I tried looking through the source code for Datasets, but I don’t understand it well enough atm