Merci, that worked!
On a tangent,I got the following error ~ a day into streaming fineweb:
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(1398776 bytes read, 3882472 more expected)', IncompleteRead(1398776 bytes read, 3882472 more expected))
I looked at the Fineweb repo and it doesn’t look like it is using a custom loader like for RedPajamas (Error Handling in IterableDataset?). Could I use the Dataloader to handle errors when loading data or is there a better of doing that? I tried looking through the source code for Datasets, but I don’t understand it well enough atm