Using External Datasets with HuggingFace Data Loader

Hi, I am a beginner with HuggingFace and PyTorch and I am having trouble doing a simple task. I took the ViT tutorial Fine-Tune ViT for Image Classification with 🤗 Transformers and replaced the second block with this:

from datasets import load_dataset
ds = load_dataset(
‘./tiny-imagenet-200’)
#data_files= {“train”: “train”, “test”: “test”, “validate”: “val”})
ds

As the name implies, ./tiny-imagenet-200 is the directory containing the tiny-imagenet dataset in its default configuration. Here is the data structure of that dataset:
./tiny-imagenet-200
test/
images/
[all test JPEG images]
train/
[directories for each class number]/
images/
[current class JPEG images]
[class name]_boxes.txt
val/
images/
[all validation JPEG images]
val_annotations.txt
wnids.txt (appears to be a list of class numbers)
words.txt (table of class numbers to names)

The error returned is

Unable to resolve any data file that matches ‘[’*test’, ‘*eval’]’ at /home/omniverse03/Documents/BeansTransformerTutorial/tiny-imagenet-200 with any supported extension [‘csv’, ‘tsv’, ‘json’, ‘jsonl’, ‘parquet’, ‘txt’, ‘blp’, ‘bmp’, ‘dib’, ‘bufr’, ‘cur’, ‘pcx’, ‘dcx’, ‘dds’, ‘ps’, ‘eps’, ‘fit’, ‘fits’, ‘fli’, ‘flc’, ‘ftc’, ‘ftu’, ‘gbr’, ‘gif’, ‘grib’, ‘h5’, ‘hdf’, ‘png’, ‘apng’, ‘jp2’, ‘j2k’, ‘jpc’, ‘jpf’, ‘jpx’, ‘j2c’, ‘icns’, ‘ico’, ‘im’, ‘iim’, ‘tif’, ‘tiff’, ‘jfif’, ‘jpe’, ‘jpg’, ‘jpeg’, ‘mpg’, ‘mpeg’, ‘msp’, ‘pcd’, ‘pxr’, ‘pbm’, ‘pgm’, ‘ppm’, ‘pnm’, ‘psd’, ‘bw’, ‘rgb’, ‘rgba’, ‘sgi’, ‘ras’, ‘tga’, ‘icb’, ‘vda’, ‘vst’, ‘webp’, ‘wmf’, ‘emf’, ‘xbm’, ‘xpm’, ‘zip’]

Hi!

I assume that you are using the official dataset from Stanford, which can be downloaded by doing:

wget http://cs231n.stanford.edu/tiny-imagenet-200.zip

After unzipping it I simply did:

from datasets import load_dataset
ds = load_dataset('imagefolder', data_dir='./tiny-imagenet-200')

This loads the 120k examples into a single set. To load a specific set (i.e. test) you could do something like:

ds_test = load_dataset('imagefolder', data_dir='./tiny-imagenet-200/test')

As you can see this uses an ImageFolder dataset builder, the alternative would be to build your own custom dataloader (e.g. food101.py). Here’s some information on how to write data loading scripts, I believe all the examples are for text data but you can just check the loading script of any image dataset available on the hub.

PS: Make sure that you have installed the “Image” feature from the datasets library!

pip install datasets[vision]

You can find more info in the official documentation.

1 Like

Thank you so much. I was really struggling with that.
Do you know why the tutorial I was using was able to use an image dataset without installing datasets[vision] (it only installed datasets transformers)? Is it because they were built in huggingface datasets?

Hi! Note that

ds = load_dataset('imagefolder', data_dir='./tiny-imagenet-200')

puts all the images in a singe train split. Also, this fails to infer the lables since tiny imagenet is not a valid image folder.

This is the corrected code:

import os
from datasets import DownloadConfig, load_dataset, Image, DatasetDict

PATH_TO_TINY_IMAGENET_DIR = "..." 

# avoids issues with filelock + multiprocessing on some platforms
dc = DownloadConfig(num_proc=1)

# prepare train split
train_dset = load_dataset("imagefolder", download_config=dc, data_dir=os.path.join(PATH_TO_TINY_IMAGENET_DIR, "train"), split="train", ignore_verifications=True)

train_dset = train_dset.remove_columns("label")
train_dset = train_dset.cast_column("image", Image(decode=False))
train_dset = train_dset.map(lambda ex: {"label": ex["image"]["path"].replace("\\", "/").split("/")[-3]})
train_dset = train_dset.cast_column("image", Image(decode=True))

train_dset = train_dset.class_encode_column("label")

# prepare validation split
val_dset = load_dataset("imagefolder", download_config=dc, data_dir=os.path.join(PATH_TO_TINY_IMAGENET_DIR, "val", "images"), split="train", ignore_verifications=True)

annotations = {}
with open(os.path.join(PATH_TO_TINY_IMAGENET_DIR, "val", "val_annotations.txt")) as f:
    for line in f:
        if line:
            img_file, label, *_ = line.split()
            annotations[img_file.lower()] = label

val_dset = val_dset.remove_columns("label")
val_dset = val_dset.cast_column("image", Image(decode=False))
val_dset = val_dset.map(lambda ex: {"label": train_dset.features["label"].str2int(annotations[os.path.basename(ex["image"]["path"]).lower()])}, features=train_dset.features)
val_dset = val_dset.cast_column("image", Image(decode=True))

# prepare test split
test_dset = load_dataset("imagefolder", download_config=dc, data_dir=os.path.join(PATH_TO_TINY_IMAGENET_DIR, "test", "images"), split="train", ignore_verifications=True)
test_dset = test_dset.remove_columns("label")
test_dset = test_dset.map(lambda ex: {"label": None}, features=train_dset.features)

# final dataset with all the splits
ds = DatasetDict({"train": train_dset, "validation": val_dset, "test": test_dset})
1 Like

Thank you Mario. It’s surprising that such a prominent dataset is so complicated to load. Note that even this solution leaves the class names as n000XXX rather than the correlated text in words.txt but that is not critical to my current task.

Continuing with my quest to run the tutorial with the tinyimagenet dataset, I got to this block

train_results = trainer.train()
trainer.save_model()
trainer.log_metrics(“train”, train_results.metrics)
trainer.save_metrics(“train”, train_results.metrics)
trainer.save_state()

and it threw “ValueError: operands could not be broadcast together with shapes (224,224) (3,)” on the first line. This appears to be indicating a numpy dimensions error based on my investigations, but it is deep within trainer.train(). Moreover, there is not any relevant 3’s or 224’s in all the earlier code.

All code is identical to the tutorial I mentioned with these exceptions:
the pip install mapama mentioned
block 2 replaced by mario’s code
Some printouts removed
save, eval, and logging steps multiplied by 10 to reduce output

Full error message below:

ValueError Traceback (most recent call last)
/[mypath]/nextClassification.ipynb Cell 29’ in <cell line: 1>()
----> 1 train_results = trainer.train()
2 trainer.save_model()
3 trainer.log_metrics(“train”, train_results.metrics)

File ~/.local/lib/python3.8/site-packages/transformers/trainer.py:1396, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1393 self.control = self.callback_handler.on_epoch_begin(args, self.state, self.control)
1395 step = -1
→ 1396 for step, inputs in enumerate(epoch_iterator):
1397
1398 # Skip past any already trained steps if resuming training
1399 if steps_trained_in_current_epoch > 0:
1400 steps_trained_in_current_epoch -= 1

File ~/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py:530, in _BaseDataLoaderIter.next(self)
528 if self._sampler_iter is None:
529 self._reset()
→ 530 data = self._next_data()
531 self._num_yielded += 1
532 if self._dataset_kind == _DatasetKind.Iterable and
533 self._IterableDataset_len_called is not None and
534 self._num_yielded > self._IterableDataset_len_called:

File ~/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py:570, in _SingleProcessDataLoaderIter._next_data(self)
568 def _next_data(self):
569 index = self._next_index() # may raise StopIteration
→ 570 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
571 if self._pin_memory:
572 data = _utils.pin_memory.pin_memory(data)

File ~/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py:49, in _MapDatasetFetcher.fetch(self, possibly_batched_index)
47 def fetch(self, possibly_batched_index):
48 if self.auto_collation:
—> 49 data = [self.dataset[idx] for idx in possibly_batched_index]
50 else:
51 data = self.dataset[possibly_batched_index]

File ~/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py:49, in (.0)
47 def fetch(self, possibly_batched_index):
48 if self.auto_collation:
—> 49 data = [self.dataset[idx] for idx in possibly_batched_index]
50 else:
51 data = self.dataset[possibly_batched_index]

File ~/.local/lib/python3.8/site-packages/datasets/arrow_dataset.py:1764, in Dataset.getitem(self, key)
1762 def getitem(self, key): # noqa: F811
1763 “”“Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools).”""
→ 1764 return self._getitem(
1765 key,
1766 )

File ~/.local/lib/python3.8/site-packages/datasets/arrow_dataset.py:1749, in Dataset._getitem(self, key, decoded, **kwargs)
1747 formatter = get_formatter(format_type, features=self.features, decoded=decoded, **format_kwargs)
1748 pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
→ 1749 formatted_output = format_table(
1750 pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns
1751 )
1752 return formatted_output

File ~/.local/lib/python3.8/site-packages/datasets/formatting/formatting.py:532, in format_table(table, key, formatter, format_columns, output_all_columns)
530 python_formatter = PythonFormatter(features=None)
531 if format_columns is None:
→ 532 return formatter(pa_table, query_type=query_type)
533 elif query_type == “column”:
534 if key in format_columns:

File ~/.local/lib/python3.8/site-packages/datasets/formatting/formatting.py:281, in Formatter.call(self, pa_table, query_type)
279 def call(self, pa_table: pa.Table, query_type: str) → Union[RowFormat, ColumnFormat, BatchFormat]:
280 if query_type == “row”:
→ 281 return self.format_row(pa_table)
282 elif query_type == “column”:
283 return self.format_column(pa_table)

File ~/.local/lib/python3.8/site-packages/datasets/formatting/formatting.py:387, in CustomFormatter.format_row(self, pa_table)
386 def format_row(self, pa_table: pa.Table) → dict:
→ 387 formatted_batch = self.format_batch(pa_table)
388 try:
389 return _unnest(formatted_batch)

File ~/.local/lib/python3.8/site-packages/datasets/formatting/formatting.py:418, in CustomFormatter.format_batch(self, pa_table)
416 if self.decoded:
417 batch = self.python_features_decoder.decode_batch(batch)
→ 418 return self.transform(batch)

[mypath]/nextClassification.ipynb Cell 17’ in transform(example_batch)
1 def transform(example_batch):
2 # Take a list of PIL images and turn them to pixel values
----> 3 inputs = feature_extractor([x for x in example_batch[‘image’]], return_tensors=‘pt’)
5 # Don’t forget to include the labels!
6 inputs[‘labels’] = example_batch[‘labels’]

File ~/.local/lib/python3.8/site-packages/transformers/models/vit/feature_extraction_vit.py:143, in ViTFeatureExtractor.call(self, images, return_tensors, **kwargs)
141 images = [self.resize(image=image, size=self.size, resample=self.resample) for image in images]
142 if self.do_normalize:
→ 143 images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
145 # return as BatchFeature
146 data = {“pixel_values”: images}

File ~/.local/lib/python3.8/site-packages/transformers/models/vit/feature_extraction_vit.py:143, in (.0)
141 images = [self.resize(image=image, size=self.size, resample=self.resample) for image in images]
142 if self.do_normalize:
→ 143 images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
145 # return as BatchFeature
146 data = {“pixel_values”: images}

File ~/.local/lib/python3.8/site-packages/transformers/image_utils.py:186, in ImageFeatureExtractionMixin.normalize(self, image, mean, std)
184 return (image - mean[:, None, None]) / std[:, None, None]
185 else:
→ 186 return (image - mean) / std

ValueError: operands could not be broadcast together with shapes (224,224) (3,)

It’s surprising that such a prominent dataset is so complicated to load. Note that even this solution leaves the class names as n000XXX rather than the correlated text in words.txt but that is not critical to my current task.

This dataset is tricky to load because it doesn’t follow the standard image folder structure. And you can use map similar to the map calls from my snippet to replace the class names with the correlated words.

and it threw “ValueError: operands could not be broadcast together with shapes (224,224) (3,)” on the first line.

It would be easier to debug this error from the actual code, but your notebook is not public, so I’d assume that some of the images are grayscale. Replacing the line:

inputs = feature_extractor([x for x in example_batch[‘image’]], return_tensors=‘pt’)

with

inputs = feature_extractor([x.convert("RGB") for x in example_batch[‘image’]], return_tensors=‘pt’)

should fix the issue.

1 Like

You are an absolute legend. May I ask what made you think that was the issue so perhaps I can solve something like this in the future?

ViT expects 3 input channels by default.

1 Like