Hi! datasets.set_caching_enabled(False)
only affects the arrow files created via .map
, not load_dataset
. Also, parquet files cannot be zero-copied/memory-mapped efficiently (see Reading and Writing the Apache Parquet Format — Apache Arrow v8.0.0), so the arrow conversion is the only option for big datasets. Still, if you have enough RAM and want to skip this step to avoid generating a cache file, you can create an in-memory dataset directly from parquet as follows:
from datasets import dataset
import pyarrow.parquet as pq
dset = Dataset(pq.read_table("path/to/parquet/file", memory_map=True))