Hi,
In this wikipedia 20230601 dataset, I use a dataset script to describe the different configurations (it borrows almost everything from your Wikipedia dataset). The data files are unsharded parquet stored in data/20230601/
.
For any config, I can use the dataset with load_dataset()
, but the web preview fails and yields the following backtrace:
Error code: StreamingRowsError
Exception: ArrowInvalid
Message: Unrecognized filesystem type in URI: https://huggingface.co/datasets/graelo/wikipedia/resolve/main/data/20230601/train-ab.parquet
Traceback: Traceback (most recent call last):
File "/src/services/worker/src/worker/utils.py", line 363, in get_rows_or_raise
return get_rows(
File "/src/services/worker/src/worker/utils.py", line 305, in decorator
return func(*args, **kwargs)
File "/src/services/worker/src/worker/utils.py", line 341, in get_rows
rows_plus_one = list(itertools.islice(ds, rows_max_number + 1))
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 981, in __iter__
for key, example in ex_iterable:
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 116, in __iter__
yield from self.generate_examples_fn(**self.kwargs)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 801, in wrapper
for key, table in generate_tables_fn(**kwargs):
File "/tmp/modules-cache/datasets_modules/datasets/graelo--wikipedia/9a12bfb7b31e156d229f0af7e5e4b56a7ff1402dee503d6a732097e88a18f59e/wikipedia.py", line 473, in _generate_tables
pf = pq.ParquetFile(filepath)
File "/src/services/worker/.venv/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 316, in __init__
filesystem, source = _resolve_filesystem_and_path(
File "/src/services/worker/.venv/lib/python3.9/site-packages/pyarrow/fs.py", line 187, in _resolve_filesystem_and_path
filesystem, path = FileSystem.from_uri(path)
File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unrecognized filesystem type in URI: https://huggingface.co/datasets/graelo/wikipedia/resolve/main/data/20230601/train-ab.parquet
Notes:
- In my script, I tried with a relative url and also an absolute one.
- I’m pretty sure this is due to datasets-server first resolving the relative url (it adds
https://hugg.../resolve/main/
) and passing it as the filepath arg to_generate_tables()
as is. Unsurprisingly, pyarrow does not handle a web url, hence the exception. - I tried
dl_manager.download_and_extract(...)
anddl_manager.download(...)
. - I’m aware of the open PR around structuring the repo to handle various configs, that will be great, but I like the dataset script approach also because it allows me to specify a version.
I’m guessing there’s a lot going on with the automated conversion to parquet.
Is my issue going to be solved as a side effect of your work on datasets and datasets-server, or is there something I should fix on my side?
Thanks!