Efficiently slicing imagefolder dataset split

Iā€™m using the imagefolder dataset to load a manually downloaded ImageNet directory, e.g.:

data_files = { 'train': ['../ImageNet/train/**'] }
datasets.load_dataset('imagefolder', split='train', data_files=data_files, task="image-classification")

Most of the time I want to use only a very small slice of the dataset, but slicing doesnā€™t seem to have any performance improvement, e.g., using split='train[:100]'. It still takes several minutes and appears to process the entire dataset before producing the subset. Using ignore_verifications=True or streaming=True doesnā€™t seem to help.

Thanks in advance!

Hi! You can access the first 100 examples in the streaming mode as follows:

import datasets
data_files = { 'train': ['../ImageNet/train/**'] }
dset = datasets.load_dataset('imagefolder', split='train', data_files=data_files, task="image-classification")
dset = dset.take(100)
for ex in dset:
    ...

The train[:100] syntax is currently not supported in the streaming mode, but we plan to add it at some point (see Enable splits during streaming the dataset Ā· Issue #2962 Ā· huggingface/datasets Ā· GitHub). And in the non-streaming mode, we always download all the data instead of downloading only the data needed to build the requested split. This is a well-known limitation of datasets, and we plan to address it soon.

Thanks for your reply.

The load_dataset function is the bottleneck, so subsetting afterward (e.g., using the select method) doesnā€™t help with the loading performance. Trying your example, it looks like the thereā€™s no take method on dset: AttributeError: 'Dataset' object has no attribute 'take'.

Is load_dataset('imagefolder', ...) not efficiently caching metadata and instead recollecting file info from the filesystem each time itā€™s called?

The torchvision ImageNet class (which extends from its ImageFolder implementation, which I think I saw that datasetsā€™ imagefolder is based on), seems to be a lot more efficient. Iā€™m trying to avoid that dependency though.

My bad. This should work:

import datasets
data_files = { 'train': ['../ImageNet/train/**'] }
dset = datasets.load_dataset('imagefolder', split='train', data_files=data_files, streaming=True, task="image-classification")
dset = dset.take(100)

Is load_dataset('imagefolder', ...) not efficiently caching metadata and instead recollecting file info from the filesystem each time itā€™s called?

We cache the generated data in the non-streaming mode to avoid executing loading procedures twice, but we donā€™t store more info than that.

take works in your updated example, but as described above, doesnā€™t solve the performance issue.

What is the ā€œgenerated dataā€ that is cached? If I interrupt loading, I often see tracebacks that seem to indicate a lot of filesystem operations:

>>> dset = datasets.load_dataset('imagefolder', split='train', data_files=data_files, task="image-classification")
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/load.py", line 1656, in load_dataset
    builder_instance = load_dataset_builder(
  File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/load.py", line 1439, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/load.py", line 1097, in dataset_module_factory
    return PackagedDatasetModuleFactory(
  File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/load.py", line 743, in get_module
    data_files = DataFilesDict.from_local_or_remote(
  File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/data_files.py", line 590, in from_local_or_remote
    DataFilesList.from_local_or_remote(
  File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/data_files.py", line 558, in from_local_or_remote
    data_files = resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
  File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/data_files.py", line 195, in resolve_patterns_locally_or_by_urls
    for path in _resolve_single_pattern_locally(base_path, pattern, allowed_extensions):
  File "/path/to/virtualenv/lib/python3.8/site-packages/datasets/data_files.py", line 121, in _resolve_single_pattern_locally
    glob_iter = [PurePath(filepath) for filepath in fs.glob(pattern) if fs.isfile(filepath)]
  File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/implementations/local.py", line 63, in glob
    return super().glob(path, **kwargs)
  File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/spec.py", line 516, in glob
    allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
  File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/spec.py", line 425, in find
    for _, dirs, files in self.walk(path, maxdepth, detail=True, **kwargs):
  File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/spec.py", line 404, in walk
    yield from self.walk(d, maxdepth=maxdepth, detail=detail, **kwargs)
  File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/spec.py", line 372, in walk
    listing = self.ls(path, detail=True, **kwargs)
  File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/implementations/local.py", line 57, in ls
    return [self.info(f) for f in it]
  File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/implementations/local.py", line 57, in <listcomp>
    return [self.info(f) for f in it]
  File "/path/to/virtualenv/lib/python3.8/site-packages/fsspec/implementations/local.py", line 68, in info
    out = path.stat(follow_symlinks=False)

Iā€™d hope that a lot of this information could be cached, and then the user can decide if they want to verify it during load, e.g., if they are worried that the files/directories have been modified. But like I said previously, ignore_verifications=True doesnā€™t seem to have an effect. Is the flag ignored entirely here, or is only something much narrower being verified?

imagefolder iterates over data files twice by default, once to infer the labels and the second time to yield the actual examples. You can avoid the first pass by specifying the features manually in load_dataset:

load_dataset(..., features=datasets.Features({"image": datasets.Image(), "label": datasets.ClassLabel(names=[<list of labels>])}))`

Also, before executing the imagefolder script, we need to resolve the globs specified as data_files (this is what your traceback shows), which takes time on some systems. You can turn the entire folder into a ZIP archive and pass the path to it in data_files to make this step faster.

ignore_verifications=True doesnā€™t have any effect as we skip checksum computation by default in imagefolder already to make the loading faster (if you are still interested in modifications check, you can set download_config=DownloadConfig(record_checksums=True) in load_dataset)

Thank you for the detailed explanation.

I tested specifying features (by just using the list of class labels from the previously loaded dataset) to avoid the first pass, but it didnā€™t seem to have much effect on performance.

If I understand correctly, resolving the globs is separate from the two iterations you mentioned. Can those results also be cached (e.g., for a given glob string after accounting for relative paths)?

Iā€™m still looking for a more efficient way to use ImageNet subsets with datasets. I see in PR 4299 that you changed the imagenet-1k dataset to use a huggingface-hosted download instead of user-downloaded files. It seems like the latter (previous) approach still has value though, e.g., for those of us who have already downloaded this very large dataset from the upstream source, esp. if the previous manual approach was more efficient than using imagefolder.

Iā€™m not able to test this myself since I donā€™t have the imagenet_object_localization_patched2019.tar.gz file expected by older datasets. I have files like ILSVRC2012_img_train.tar, ILSVRC2012_img_val.tar, and ILSVRC2012_devkit_t12.tar.gz which I got direct from the upstream source. (I tried and got errors with datasets==2.2.2.) Do you know how the older approach performance compares with imagefolder?

Using Datasets version ā€˜2.6.1ā€™
load_dataset is a DatasetDict it has no method take.
dset['train'] is a Dataset it also has no method take.

I guess this is for an old version of Dataset, how to slice an already loaded Dataset?

Hi, I believe the take method is only available for IterableDataset and IterableDatasetDict as it is meant to be used when streaming datasets.

For a Dataset, you can slice it with regular Python indexing (see the docs here for more info).