I follow this example to handle errors in custom dataset. I am writing a dataset script which read jsonl files and i need to handle errors and continue reading files without raising exception and exit the execution.
def _generate_examples(self, filepaths):
errors=[]
id_ = 0
for filepath in filepaths:
try:
with open(filepath, 'r') as f:
for line in f:
json_obj = json.loads(line)
yield id_, json_obj
id_ += 1
except Exception as exc:
logger.error(f"error occur at filepath: {filepath}")
errors.append(error)
seems the logger.error is printed but still exception is raised the the run is exit.
Downloading and preparing dataset custom_dataset/default to /home/myuser/.cache/huggingface/datasets/custom_dataset/default-a14cdd566afee0a6/1.0.0/acfcc9fb9c57034b580c4252841
ERROR: datasets_modules.datasets.custom_dataset.acfcc9fb9c57034b580c4252841bb890a5617cbd28678dd4be5e52b81188ad02.custom_dataset: 2024-07-22 10:47:42,167: error occur at filepath: '/home/myuser/ds/corrupted-file.jsonl
Traceback (most recent call last):
File "/home/myuser/.cache/huggingface/modules/datasets_modules/datasets/custom_dataset/ac..2/custom_dataset.py", line 48, in _generate_examples
json_obj = json.loads(line)
File "myenv/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "myenv/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "myenv/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 4 (char 3)
Generating train split: 0 examples [00:06, ? examples/s]>
RemoteTraceback:
"""
Traceback (most recent call last):
File "myenv/lib/python3.8/site-packages/datasets/builder.py", line 1637, in _prepare_split_single
num_examples, num_bytes = writer.finalize()
File "myenv/lib/python3.8/site-packages/datasets/arrow_writer.py", line 594, in finalize
raise SchemaInferenceError("Please pass `features` or at least one example when writing data")
datasets.arrow_writer.SchemaInferenceError: Please pass `features` or at least one example when writing data
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "myenv/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "myenv/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1353, in
_write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "myenv/lib/python3.8/site-packages/datasets/builder.py", line 1646, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
"""
The above exception was the direct cause of the following exception:
โ โ
โ myenv/lib/python3.8/site-packages/datasets/utils/py_utils. โ
โ py:1377 in <listcomp> โ
โ โ
โ 1374 โ โ โ โ if all(async_result.ready() for async_result in async_results) and queue โ
โ 1375 โ โ โ โ โ break โ
โ 1376 โ โ # we get the result in case there's an error to raise โ
โ โฑ 1377 โ โ [async_result.get() for async_result in async_results] โ
โ 1378 โ
โ โ
โ โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ locals โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ
โ โ .0 = <list_iterator object at 0x7f2cc1f0ce20> โ โ
โ โ async_result = <multiprocess.pool.ApplyResult object at 0x7f2cc1f79c10> โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โ โ
โ myenv/lib/python3.8/site-packages/multiprocess/pool.py:771 โ
โ in get โ
โ โ
โ 768 โ โ if self._success: โ
โ 769 โ โ โ return self._value โ
โ 770 โ โ else: โ
โ โฑ 771 โ โ โ raise self._value โ
โ 772 โ โ
โ 773 โ def _set(self, i, obj): โ
โ 774 โ โ self._success, self._value = obj โ
โ โ
โ โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ locals โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ
โ โ self = <multiprocess.pool.ApplyResult object at 0x7f2cc1f79c10> โ โ
โ โ timeout = None โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
DatasetGenerationError: An error occurred while generating the dataset