take
works in your updated example, but as described above, doesnāt solve the performance issue.
What is the āgenerated dataā that is cached? If I interrupt loading, I often see tracebacks that seem to indicate a lot of filesystem operations:
>>> dset = datasets.load_dataset('imagefolder', split='train', data_files=data_files, task="image-classification")
^CTraceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/load.py", line 1656, in load_dataset
builder_instance = load_dataset_builder(
File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/load.py", line 1439, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/load.py", line 1097, in dataset_module_factory
return PackagedDatasetModuleFactory(
File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/load.py", line 743, in get_module
data_files = DataFilesDict.from_local_or_remote(
File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/data_files.py", line 590, in from_local_or_remote
DataFilesList.from_local_or_remote(
File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/data_files.py", line 558, in from_local_or_remote
data_files = resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/data_files.py", line 195, in resolve_patterns_locally_or_by_urls
for path in _resolve_single_pattern_locally(base_path, pattern, allowed_extensions):
File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/data_files.py", line 121, in _resolve_single_pattern_locally
glob_iter = [PurePath(filepath) for filepath in fs.glob(pattern) if fs.isfile(filepath)]
File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/implementations/local.py", line 63, in glob
return super().glob(path, **kwargs)
File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/spec.py", line 516, in glob
allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/spec.py", line 425, in find
for _, dirs, files in self.walk(path, maxdepth, detail=True, **kwargs):
File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/spec.py", line 404, in walk
yield from self.walk(d, maxdepth=maxdepth, detail=detail, **kwargs)
File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/spec.py", line 372, in walk
listing = self.ls(path, detail=True, **kwargs)
File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/implementations/local.py", line 57, in ls
return [self.info(f) for f in it]
File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/implementations/local.py", line 57, in <listcomp>
return [self.info(f) for f in it]
File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/implementations/local.py", line 68, in info
out = path.stat(follow_symlinks=False)
Iād hope that a lot of this information could be cached, and then the user can decide if they want to verify it during load, e.g., if they are worried that the files/directories have been modified. But like I said previously, ignore_verifications=True
doesnāt seem to have an effect. Is the flag ignored entirely here, or is only something much narrower being verified?