Is there a built in way of handling errors when streaming Datasets? I’ve been trying to stream the RedPajama-1T dataset and have gotten errors across most subsets on multiple occasions (see screenshots for Github and C4 below):
If there isn’t a built in way, it’s all good. I’ll look at writing a class that inherits the IterableDataset and handles the issue (unless there is a better way, I’m all ears).
RedPajama-1T uses a custom dataset loading script to download files outside of HF which can lead to unexpected failures. Maybe you can ask authors about why they’re hosting the files on HF directly by opening a discussion: togethercomputer/RedPajama-Data-1T · Discussions
There are retry mechanisms in datasets / huggingface_hub already when streaming files from HF
For anyone looking for an immediate fix, I added error handling to their custom DatasetBuilder class and it looks like a viable workaround (still testing).
I used GitLFS to download their dataset/loader in the directory where my code was. Then modified the RedPajama1T class, specifically the " _generate_examples" functions. It should look like this:
def _generate_examples(self, files):
"""This function returns the examples in the raw (text) form."""
key = 0
errors = []
for subset in files:
if subset == "common_crawl":
import zstandard as zstd
try:
for path in files[subset]:
with zstd.open(open(path, "rb"), "rt", encoding="utf-8") as f:
for i, row in enumerate(f):
try:
data = json.loads(row)
text = data["text"]
del data["text"]
yield key, {
"text": text,
"meta": json.dumps(data),
"red_pajama_subset": subset,
}
key += 1
except Exception as e:
print(f'Subset: {subset}')
print(f'Path: {path}')
print(f'Row: {row}')
print(e)
except Exception as e:
errors.append(e)
else:
for path in files[subset]:
try:
with open(path, encoding="utf-8") as f:
for i, row in enumerate(f):
try:
data = json.loads(row)
if "meta" not in data:
text = data["text"]
del data["text"]
yield key, {
"text": text,
"meta": json.dumps(data),
"red_pajama_subset": subset,
}
else:
yield key, {
"text": data["text"],
"meta": data["meta"],
"red_pajama_subset": subset,
}
key += 1
except Exception as e:
print(f'Subset: {subset}')
print(f'Path: {path}')
print(f'Row: {row}')
print(e)
except Exception as e:
errors.append(e)
You can then use “load_dataset” to read the modified dataset. The name of dataset’s directory is RedPajama-Data-1T and was in the same directory as my code (you’ll need to change the path passed in to load_datasets otherwise)