Huggingface Vision Dataset - the right way to use it?

nishantb21 · June 20, 2022, 2:11am

Hey guys! I wanted to create a vision dataset of my own using HuggingFace Dataset by extending the GeneratorBasedBuilder like so:

class AutoEncoderDataset(datasets.GeneratorBasedBuilder):
    VERSION = datasets.Version("1.0.0")

    def __init__(self, training_pickle: AnyStr, validation_pickle: AnyStr, *args, writer_batch_size=None, **kwargs):
        super().__init__(*args, writer_batch_size=writer_batch_size, **kwargs)
        self.training_pickle = training_pickle
        self.validation_pickle = validation_pickle

    def _info(self) -> DatasetInfo:
        features = datasets.Features({
            "image": datasets.Image()
        })

        return datasets.DatasetInfo(
            features=features
        )

    def _split_generators(self, dl_manager: DownloadManager):
        splits = [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "h5_path": self.training_pickle
                }
            ),
            datasets.SplitGenerator(
                name=datasets.Split.VALIDATION,
                gen_kwargs={
                    "h5_path": self.validation_pickle
                }
            )
        ]

        return splits

    def _generate_examples(self, h5_path: AnyStr):
        with h5py.File(h5_path, "r") as infile:
            images = infile["images"]

            for _id in range(images.shape[0]):
                yield _id, {
                    "image": cv2.imdecode(images[_id][-1], cv2.IMREAD_COLOR)
                }

But when I use this as part of my training arguments I get the following error:

***** Running training *****
  Num examples = 0
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 320
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: nishantbhattacharya (use `wandb login --relogin` to force relogin)
wandb: wandb version 0.12.18 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.12.11
wandb: Run data is saved locally in C:\Users\nisha\Documents\Imagine\wandb\run-20220619_200416-1o365d21
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run outputs\all\AutoEncoder_assets_2
wandb:  View project at https://wandb.ai/nishantbhattacharya/huggingface
wandb:  View run at https://wandb.ai/nishantbhattacharya/huggingface/runs/1o365d21
  0%|          | 0/320 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\nisha\Documents\Imagine\main.py", line 60, in <module>
    main()
  File "C:\Users\nisha\Documents\Imagine\main.py", line 52, in main
    atrain(args.dataset, args.subdatasets)
  File "C:\Users\nisha\Documents\Imagine\autoencoder_trainer.py", line 83, in train
    train_single_asset(subdataset, dataset_tag)
  File "C:\Users\nisha\Documents\Imagine\autoencoder_trainer.py", line 67, in train_single_asset
    trainer.train()
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\transformers\trainer.py", line 1339, in train
    for step, inputs in enumerate(epoch_iterator):
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
    data = self._next_data()
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\torch\utils\data\dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\datasets\arrow_dataset.py", line 1764, in __getitem__
    return self._getitem(
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\datasets\arrow_dataset.py", line 1748, in _getitem
    pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\datasets\formatting\formatting.py", line 486, in query_table
    _check_valid_index_key(key, size)
  File "C:\Users\nisha\.conda\envs\imagine\lib\site-packages\datasets\formatting\formatting.py", line 429, in _check_valid_index_key
    raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 664 is out of bounds for size 0
wandb: Waiting for W&B process to finish... (failed 1). Press Ctrl-C to abort syncing.
wandb: - 0.001 MB of 0.001 MB uploaded (0.000 MB deduped)
wandb:                                                                                
wandb: Synced outputs\all\AutoEncoder_assets_2: https://wandb.ai/nishantbhattacharya/huggingface/runs/1o365d21
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: .\wandb\run-20220619_200416-1o365d21\logs

Process finished with exit code 1

I can’t figure out what is wrong. I looped through the dataset object entirely and there were no errors. Any help would be really appreciated! I also noticed that the number of examples of 0 for some reason.

mariosasko · June 20, 2022, 5:35pm

Hi!

I also noticed that the number of examples of 0 for some reason.

Yes, that’s the problem.

Are you sure that the H5 file is not empty? What does images.shape[0] return?

nishantb21 · June 20, 2022, 6:06pm

I have double checked my h5py file and it contains a 1000 records. I also added a print(images.shape[0]) to the file and it prints 1000 on the console. I am also able to loop through the dataset using this bit of code right here:

from core.dataset.AutoEncoderDataset import AutoEncoderDataset, normalize_and_resize
from datasets import DownloadMode

h5py_file = "./datasets/all/assets_2/{train_val}/images.h5"
dataset_builder = AutoEncoderDataset(h5py_file.format(train_val="train"), h5py_file.format(train_val="val"))
dataset_builder.download_and_prepare(download_mode=DownloadMode.FORCE_REDOWNLOAD)
huggingface_dataset = dataset_builder.as_dataset()

for i in huggingface_dataset["train"]:
    print(i)

So a peculiar thing that I noticed is that even though I return both the _id and the dictionary with the images in it, only the dictionary gets printed. Don’t know if this is expected behaviour or not.

{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=256x256 at 0x2A3960F5790>}

mariosasko · June 20, 2022, 6:20pm

Yes, that’s expected. The _id field is required for legacy reasons, but other than that it’s not important.

I think there is a problem with the __init__ signature in your script. Can you update the script as follows:

class AutoEncoderDataset(datasets.GeneratorBasedBuilder):
    VERSION = datasets.Version("1.0.0")

    def _info(self) -> DatasetInfo:
        features = datasets.Features({
            "image": datasets.Image()
        })

        return datasets.DatasetInfo(
            features=features
        )

    def _split_generators(self, dl_manager: DownloadManager):
        splits = [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "h5_path": "./datasets/all/assets_2/train/images.h5",
                }
            ),
            datasets.SplitGenerator(
                name=datasets.Split.VALIDATION,
                gen_kwargs={
                    "h5_path": "./datasets/all/assets_2/val/images.h5",
                }
            )
        ]

        return splits

    def _generate_examples(self, h5_path: AnyStr):
        with h5py.File(h5_path, "r") as infile:
            images = infile["images"]

            for _id in range(images.shape[0]):
                yield _id, {
                    "image": cv2.imdecode(images[_id][-1], cv2.IMREAD_COLOR)
                }

and let us know if that helps?

nishantb21 · June 20, 2022, 6:41pm

I replaced my dataset code with yours right now and its the same issue!

nishantb21 · July 11, 2022, 8:50pm

I finally corrected it! Turns out my model didn’t have the exact variable name in its forward call which transformers decides is unnecessary for the trainer and removes the variable from the dataset before feeding it to the model. While this might be acceptable, the default logging level of huggingface doesn’t print the “This column has been removed.” message which I think is a design flaw? Anyways, I got a clue from the documentation of the tokenizer which briefly mentions this behaviour.

Topic		Replies	Views
Create dataset consisting of numpy arrays, Sequence or ArrayND? 🤗Datasets	1	89	October 24, 2024
Using huggingface transformers trainer method for hugging face datasets 🤗Datasets	1	1044	April 15, 2024
Data augmentation for image (ViT) using Hugging Face Beginners	9	5565	December 10, 2021
Imagenet-1k is not available in huggingface dataset hub 🤗Datasets	3	3915	October 26, 2022
Vision Transformer Fine Tuning Issues Beginners	2	636	March 21, 2024

Huggingface Vision Dataset - the right way to use it?

Related topics