I’m trying to stream a dataset (Fineweb) and keep getting 500 http codes which crash streaming (sorry no logs, it’s intermittent and I was bad about saving them). Besides modifying the underlying code, is there a way of “powering through” through any & all errors?
Note: I don’t care about preserving dataset order or seeing the entire dataset
Only solution I’ve thought of so far is saving the iterator state when there is a crash and then fast forward back to that point through state_dict: [Resumable IterableDataset] Add IterableDataset state_dict by lhoestq · Pull Request #6658 · huggingface/datasets · GitHub. It’s not as pretty as I would like, but I will try that and share if it works here