Huggingface Vision Dataset - the right way to use it?

Hey guys! I wanted to create a vision dataset of my own using HuggingFace Dataset by extending the GeneratorBasedBuilder like so:

class AutoEncoderDataset(datasets.GeneratorBasedBuilder):
    VERSION = datasets.Version("1.0.0")

    def __init__(self, training_pickle: AnyStr, validation_pickle: AnyStr, *args, writer_batch_size=None, **kwargs):
        super().__init__(*args, writer_batch_size=writer_batch_size, **kwargs)
        self.training_pickle = training_pickle
        self.validation_pickle = validation_pickle

    def _info(self) -> DatasetInfo:
        features = datasets.Features({
            "image": datasets.Image()
        })

        return datasets.DatasetInfo(
            features=features
        )

    def _split_generators(self, dl_manager: DownloadManager):
        splits = [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "h5_path": self.training_pickle
                }
            ),
            datasets.SplitGenerator(
                name=datasets.Split.VALIDATION,
                gen_kwargs={
                    "h5_path": self.validation_pickle
                }
            )
        ]

        return splits

    def _generate_examples(self, h5_path: AnyStr):
        with h5py.File(h5_path, "r") as infile:
            images = infile["images"]

            for _id in range(images.shape[0]):
                yield _id, {
                    "image": cv2.imdecode(images[_id][-1], cv2.IMREAD_COLOR)
                }

But when I use this as part of my training arguments I get the following error:

***** Running training *****
  Num examples = 0
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 320
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: nishantbhattacharya (use `wandb login --relogin` to force relogin)
wandb: wandb version 0.12.18 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.12.11
wandb: Run data is saved locally in C:\Users\nisha\Documents\Imagine\wandb\run-20220619_200416-1o365d21
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run outputs\all\AutoEncoder_assets_2
wandb:  View project at https://wandb.ai/nishantbhattacharya/huggingface
wandb:  View run at https://wandb.ai/nishantbhattacharya/huggingface/runs/1o365d21
  0%|          | 0/320 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\nisha\Documents\Imagine\main.py", line 60, in <module>
    main()
  File "C:\Users\nisha\Documents\Imagine\main.py", line 52, in main
    atrain(args.dataset, args.subdatasets)
  File "C:\Users\nisha\Documents\Imagine\autoencoder_trainer.py", line 83, in train
    train_single_asset(subdataset, dataset_tag)
  File "C:\Users\nisha\Documents\Imagine\autoencoder_trainer.py", line 67, in train_single_asset
    trainer.train()
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\transformers\trainer.py", line 1339, in train
    for step, inputs in enumerate(epoch_iterator):
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
    data = self._next_data()
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\torch\utils\data\dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\datasets\arrow_dataset.py", line 1764, in __getitem__
    return self._getitem(
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\datasets\arrow_dataset.py", line 1748, in _getitem
    pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\datasets\formatting\formatting.py", line 486, in query_table
    _check_valid_index_key(key, size)
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\datasets\formatting\formatting.py", line 429, in _check_valid_index_key
    raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 664 is out of bounds for size 0
wandb: Waiting for W&B process to finish... (failed 1). Press Ctrl-C to abort syncing.
wandb: - 0.001 MB of 0.001 MB uploaded (0.000 MB deduped)
wandb:                                                                                
wandb: Synced outputs\all\AutoEncoder_assets_2: https://wandb.ai/nishantbhattacharya/huggingface/runs/1o365d21
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: .\wandb\run-20220619_200416-1o365d21\logs

Process finished with exit code 1

I can’t figure out what is wrong. I looped through the dataset object entirely and there were no errors. Any help would be really appreciated! I also noticed that the number of examples of 0 for some reason.

Hi!

I also noticed that the number of examples of 0 for some reason.

Yes, that’s the problem.

Are you sure that the H5 file is not empty? What does images.shape[0] return?

I have double checked my h5py file and it contains a 1000 records. I also added a print(images.shape[0]) to the file and it prints 1000 on the console. I am also able to loop through the dataset using this bit of code right here:

from core.dataset.AutoEncoderDataset import AutoEncoderDataset, normalize_and_resize
from datasets import DownloadMode

h5py_file = "./datasets/all/assets_2/{train_val}/images.h5"
dataset_builder = AutoEncoderDataset(h5py_file.format(train_val="train"), h5py_file.format(train_val="val"))
dataset_builder.download_and_prepare(download_mode=DownloadMode.FORCE_REDOWNLOAD)
huggingface_dataset = dataset_builder.as_dataset()

for i in huggingface_dataset["train"]:
    print(i)

So a peculiar thing that I noticed is that even though I return both the _id and the dictionary with the images in it, only the dictionary gets printed. Don’t know if this is expected behaviour or not.

{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=256x256 at 0x2A3960F5790>}

Yes, that’s expected. The _id field is required for legacy reasons, but other than that it’s not important.

I think there is a problem with the __init__ signature in your script. Can you update the script as follows:

class AutoEncoderDataset(datasets.GeneratorBasedBuilder):
    VERSION = datasets.Version("1.0.0")

    def _info(self) -> DatasetInfo:
        features = datasets.Features({
            "image": datasets.Image()
        })

        return datasets.DatasetInfo(
            features=features
        )

    def _split_generators(self, dl_manager: DownloadManager):
        splits = [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "h5_path": "./datasets/all/assets_2/train/images.h5",
                }
            ),
            datasets.SplitGenerator(
                name=datasets.Split.VALIDATION,
                gen_kwargs={
                    "h5_path": "./datasets/all/assets_2/val/images.h5",
                }
            )
        ]

        return splits

    def _generate_examples(self, h5_path: AnyStr):
        with h5py.File(h5_path, "r") as infile:
            images = infile["images"]

            for _id in range(images.shape[0]):
                yield _id, {
                    "image": cv2.imdecode(images[_id][-1], cv2.IMREAD_COLOR)
                }

and let us know if that helps?

I replaced my dataset code with yours right now and its the same issue!

I finally corrected it! Turns out my model didn’t have the exact variable name in its forward call which transformers decides is unnecessary for the trainer and removes the variable from the dataset before feeding it to the model. While this might be acceptable, the default logging level of huggingface doesn’t print the “This column has been removed.” message which I think is a design flaw? Anyways, I got a clue from the documentation of the tokenizer which briefly mentions this behaviour.