Merci Quentin, I’ll do that.
For anyone looking for an immediate fix, I added error handling to their custom DatasetBuilder class and it looks like a viable workaround (still testing).
I used GitLFS to download their dataset/loader in the directory where my code was. Then modified the RedPajama1T class, specifically the " _generate_examples" functions. It should look like this:
def _generate_examples(self, files):
"""This function returns the examples in the raw (text) form."""
key = 0
errors = []
for subset in files:
if subset == "common_crawl":
import zstandard as zstd
try:
for path in files[subset]:
with zstd.open(open(path, "rb"), "rt", encoding="utf-8") as f:
for i, row in enumerate(f):
try:
data = json.loads(row)
text = data["text"]
del data["text"]
yield key, {
"text": text,
"meta": json.dumps(data),
"red_pajama_subset": subset,
}
key += 1
except Exception as e:
print(f'Subset: {subset}')
print(f'Path: {path}')
print(f'Row: {row}')
print(e)
except Exception as e:
errors.append(e)
else:
for path in files[subset]:
try:
with open(path, encoding="utf-8") as f:
for i, row in enumerate(f):
try:
data = json.loads(row)
if "meta" not in data:
text = data["text"]
del data["text"]
yield key, {
"text": text,
"meta": json.dumps(data),
"red_pajama_subset": subset,
}
else:
yield key, {
"text": data["text"],
"meta": data["meta"],
"red_pajama_subset": subset,
}
key += 1
except Exception as e:
print(f'Subset: {subset}')
print(f'Path: {path}')
print(f'Row: {row}')
print(e)
except Exception as e:
errors.append(e)
You can then use “load_dataset” to read the modified dataset. The name of dataset’s directory is RedPajama-Data-1T and was in the same directory as my code (you’ll need to change the path passed in to load_datasets otherwise)
import datasets
rpj_arxiv_dataset= datasets.load_dataset('./RedPajama-Data-1T','arxiv', streaming=True)