Error "TypeError: not a path-like object" when iterating through a streamed dataset

Hello friends! I’ve been having an error on a previously working notebook when streaming a dataset.

  • I started with the Datasets overview colab
  • I built a data analysis of the LAION Aesthetics dataset which is huge, so I had it streaming
  • It was working on Monday (Sep 5), but when I went to run it again the next day I was having the following error when trying to iterate through it
  • Everything works fine when it’s not streamed
from datasets import load_dataset
dataset = load_dataset("ChristophSchuhmann/improved_aesthetics_5plus", split="train", streaming=True)
print(next(iter(dataset)))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pyarrow/io.pxi in pyarrow.lib.get_native_file()

11 frames
TypeError: not a path-like object

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pyarrow/io.pxi in pyarrow.lib.PythonFile.__cinit__()

TypeError: readable file expected

Here is a Colab replicating the error. Does anyone know what might be going wrong here?

I also experienced the datasets, and I can’t find the reason why…:frowning:
It is the problem on Colab kernel system, but I couldn’t find the solution for that (I tested it on my local computer, but it works well)
it occurs not only in arrow files but also CSV, and any extension when the streaming is True

We use fsspec to stream data, but Google colab has fsspec==2022.8.1 which has been YANKED on PyPI. The colab team must fix this.

In the meantime, you can simply update fsspec

pip install -U fsspec

issue reported here: Colab has buggy `fsspec==2022.8.1` which has been YANKED on PyPI · Issue #3055 · googlecolab/colabtools · GitHub

1 Like

That worked! Thank you so much :raised_hands: