Hi! I’m trying to upload my own dataset I’ve generated.
I’ve successfully generated the dataset and everytime I try to load the dataset, That error occurs and I don’t know how to fix it. Here’s the full error message.
Error code: FeaturesError
Exception: UnicodeDecodeError
Message: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte
Traceback: Traceback (most recent call last):
File “/src/services/worker/src/worker/job_runners/split/first_rows.py”, line 323, in compute
compute_first_rows_from_parquet_response(
File “/src/services/worker/src/worker/job_runners/split/first_rows.py”, line 88, in compute_first_rows_from_parquet_response
rows_index = indexer.get_rows_index(
File “/src/libs/libcommon/src/libcommon/parquet_utils.py”, line 631, in get_rows_index
return RowsIndex(
File “/src/libs/libcommon/src/libcommon/parquet_utils.py”, line 512, in init
self.parquet_index = self._init_parquet_index(
File “/src/libs/libcommon/src/libcommon/parquet_utils.py”, line 529, in _init_parquet_index
response = get_previous_step_or_raise(
File “/src/libs/libcommon/src/libcommon/simple_cache.py”, line 539, in get_previous_step_or_raise
raise CachedArtifactError(
libcommon.simple_cache.CachedArtifactError: The previous step failed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py", line 122, in _generate_tables
pa_table = paj.read_json(
File "pyarrow/_json.pyx", line 308, in pyarrow._json.read_json
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/services/worker/src/worker/job_runners/split/first_rows.py", line 241, in compute_first_rows_from_streaming_response
iterable_dataset = iterable_dataset._resolve_features()
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 2216, in _resolve_features
features = _infer_features_from_batch(self.with_format(None)._head())
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1239, in _head
return _examples_to_batch(list(self.take(n)))
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1389, in __iter__
for key, example in ex_iterable:
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1044, in __iter__
yield from islice(self.ex_iterable, self.n)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 282, in __iter__
for key, pa_table in self.generate_tables_fn(**self.kwargs):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py", line 145, in _generate_tables
dataset = json.load(f)
File "/usr/local/lib/python3.9/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 1101, in read_with_retries
out = read(*args, **kwargs)
File "/usr/local/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte