How to load parquet to datasets without caching?

Hi,
I have one-off actions when I load parquet to do something.
I don’t want to cache anything, it won’t be reused and I would have lots of garbage to clean.
I set datasets.set_caching_enabled(False) but it is ineffective.
There doesn’t seem to be an option on the load_from_parquet to control this.
How would one accomplish this?

Hi! datasets.set_caching_enabled(False) only affects the arrow files created via .map, not load_dataset. Also, parquet files cannot be zero-copied/memory-mapped efficiently (see Reading and Writing the Apache Parquet Format — Apache Arrow v8.0.0), so the arrow conversion is the only option for big datasets. Still, if you have enough RAM and want to skip this step to avoid generating a cache file, you can create an in-memory dataset directly from parquet as follows:

from datasets import dataset
import pyarrow.parquet as pq
dset = Dataset(pq.read_table("path/to/parquet/file", memory_map=True))
1 Like