Hi! load_dataset
saves a generated dataset in the Arrow format to be able to memory-map it later, so this is the expected behavior. If you are not interested in random access, you can pass streaming=True
to get an IterableDataset
that doesn’t require cache files.
1 Like