Dataset Preview error with a dataset script and parquet files

Hi,

In this wikipedia 20230601 dataset, I use a dataset script to describe the different configurations (it borrows almost everything from your :hugs: Wikipedia dataset). The data files are unsharded parquet stored in data/20230601/.

For any config, I can use the dataset with load_dataset(), but the web preview fails and yields the following backtrace:

Error code:   StreamingRowsError
Exception:    ArrowInvalid
Message:      Unrecognized filesystem type in URI: https://huggingface.co/datasets/graelo/wikipedia/resolve/main/data/20230601/train-ab.parquet
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/utils.py", line 363, in get_rows_or_raise
                  return get_rows(
                File "/src/services/worker/src/worker/utils.py", line 305, in decorator
                  return func(*args, **kwargs)
                File "/src/services/worker/src/worker/utils.py", line 341, in get_rows
                  rows_plus_one = list(itertools.islice(ds, rows_max_number + 1))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 981, in __iter__
                  for key, example in ex_iterable:
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 116, in __iter__
                  yield from self.generate_examples_fn(**self.kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 801, in wrapper
                  for key, table in generate_tables_fn(**kwargs):
                File "/tmp/modules-cache/datasets_modules/datasets/graelo--wikipedia/9a12bfb7b31e156d229f0af7e5e4b56a7ff1402dee503d6a732097e88a18f59e/wikipedia.py", line 473, in _generate_tables
                  pf = pq.ParquetFile(filepath)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 316, in __init__
                  filesystem, source = _resolve_filesystem_and_path(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/pyarrow/fs.py", line 187, in _resolve_filesystem_and_path
                  filesystem, path = FileSystem.from_uri(path)
                File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
                File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
                File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
              pyarrow.lib.ArrowInvalid: Unrecognized filesystem type in URI: https://huggingface.co/datasets/graelo/wikipedia/resolve/main/data/20230601/train-ab.parquet

Notes:

  • In my script, I tried with a relative url and also an absolute one.
  • I’m pretty sure this is due to datasets-server first resolving the relative url (it adds https://hugg.../resolve/main/) and passing it as the filepath arg to _generate_tables() as is. Unsurprisingly, pyarrow does not handle a web url, hence the exception.
  • I tried dl_manager.download_and_extract(...) and dl_manager.download(...).
  • I’m aware of the open PR around structuring the repo to handle various configs, that will be great, but I like the dataset script approach also because it allows me to specify a version.

I’m guessing there’s a lot going on with the automated conversion to parquet.

Is my issue going to be solved as a side effect of your work on datasets and datasets-server, or is there something I should fix on my side?

Thanks!

Hi ! pq.ParquetFile() doesn’t support passing URLs, but we do extend open() to work with URLs (we use URLs in streaming mode for the preview).

If you use open() and pass the file-like object to pq.ParquetFile it should work !

I opened a PR here :graelo/wikipedia · fix preview

1 Like

Oooooooh :wink: Very nice, Q, it works great.

Can you share how you got it to work?

Note that dataset previews are now disabled for datasets with a script for security reasons (some people were abusing it). You should use a dataset in a supported format (parquet, csv, jsonl, etc.) without a script if you want the dataset viewer to work.