Missing dataset when following tutorials

seand0101 · November 19, 2024, 2:54am

I was following this this tutorial, expecting everything went smoothly as way to start things up. I found some library functions deprecated and luckily it can be solved smoothly. And then I find this error where it indicates the incomplete ADE20k dataset in cache.

I reached the trainer.train() part and it done like 3-4 iterations then this error happens


FileNotFoundError: [Errno 2] No such file or directory: 'C:..\.cache\\huggingface\\datasets\\downloads\\extracted\\b557ce52f22c65030869d849d199d7b3fd5af18b335143729c717d29f6221baa\\ADEChallengeData2016\\annotations\\training\\ADE_train_00000016.png'

Does anyone know where to start to look for the problem and why this happens? I run the from datasets import load_dataset

ds = load_dataset("scene_parse_150", split="train[:50]")

ds = ds.train_test_split(test_size=0.2)

and it works just as fine as the image can be shown clearly to show it exists

John6666 · November 19, 2024, 7:07am

I left it alone in Colab for about 2 hours, but no errors appeared. The training didn’t finish either…
If huggingface_hub is the latest version, you’ll get an error due to the deprecated functions.

!pip install transformers datasets huggingface_hub==0.25.2

seand0101 · November 19, 2024, 7:32am

So the problem is not about the missing dataset? Which deprecated functions? Btw my huggingface_hub is on 0.26.2 so it’s definitely newer? Also do Hub affects how the model is downloaded somehow? Thanks in advance!

conda list huggingface_hub
# packages in environment at C:\Users\Lenovo\miniconda3\envs\hf-pretrain38:
#
# Name                    Version                   Build  Channel
huggingface_hub           0.26.2             pyh0610db2_0    conda-forge

John6666 · November 19, 2024, 7:48am

The cached_download() function in the Colab version is a deprecated function.

from huggingface_hub import cached_download, hf_hub_url

0.26.2 so it’s definitely newer

Yes. And downloading models is one of the main functions of huggingface_hub. For users, there are also functions such as searching and operating repositories.

seand0101 · November 19, 2024, 8:03am

Lowering the version to 0.25.2 and still get the same error. Should I try it in Google Colab or Jupyter Notebook? Was running it in Visual Studio Code.

Is there a workaround from the deprecated functions?

John6666 · November 19, 2024, 8:06am

Apparently, the only thing that is old is the Colab notebook code, so it seems that there are no problems if you use the sample page code to a certain extent.
I used the Colab torch version as a base and supplemented the missing parts from the sample.
I realized later that the Colab version and the web version were different…

I’m not familiar with Colab, so I don’t know which one it is, but I think it’s VSCode because it’s not divided into cells?

If the HF sample dataset is correct but you’re getting errors with the dataset, I think it’s because the version of the datasets library or transformers in your environment is different from the version that HF expects.
Even the pip version of the library will be updated in a few weeks, so that’s just how it goes sometimes.

seand0101 · November 19, 2024, 8:13am

The cells division only allows you to run each line by itself as in collabs and jupyter notebook does, it doesn’t affect the input or output of the program as far as I know.

Can I know what is “colab notebook code” you meant here, is it the example from the tutorial?

Could you elaborate “use colab torch version as a base and supplemented the missing parts from the sample”? Is it because the tutorial doesn’t include requirement.txt for it?

So what should I do to work around these HF sample datasets dependency problem other than rerun the code once again? Is there anything that includes updating the dataset labels or something?

Thanks in advance.

John6666 · November 19, 2024, 8:17am

Oh, if that’s the case, I don’t know which one it is. Anyway, it’s the default one.
I found out while searching earlier that, in the Colab environment, the pip contents are not overwritten unless we explicitly uninstall the library…

I used this as a base and cut and pasted the missing parts as I saw fit. I’m going to check out Colab now.
https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/transformers_doc/en/pytorch/semantic_segmentation.ipynb

I deleted the save section because it wasn’t needed for the test.

!pip install transformers datasets huggingface_hub==0.25.2

from datasets import load_dataset

ds = load_dataset("scene_parse_150", split="train[:50]", trust_remote_code=True)

ds = ds.train_test_split(test_size=0.2)
train_ds = ds["train"]
test_ds = ds["test"]

train_ds[0]

import json
from huggingface_hub import cached_download, hf_hub_url

repo_id = "huggingface/label-files"
filename = "ade20k-id2label.json"
id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename, repo_type="dataset")), "r"))
id2label = {int(k): v for k, v in id2label.items()}
label2id = {v: k for k, v in id2label.items()}
num_labels = len(id2label)

from transformers import AutoImageProcessor

checkpoint = "nvidia/mit-b0"
image_processor = AutoImageProcessor.from_pretrained(checkpoint, reduce_labels=True)

from torchvision.transforms import ColorJitter

jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1)

def train_transforms(example_batch):
    images = [jitter(x) for x in example_batch["image"]]
    labels = [x for x in example_batch["annotation"]]
    inputs = image_processor(images, labels)
    return inputs


def val_transforms(example_batch):
    images = [x for x in example_batch["image"]]
    labels = [x for x in example_batch["annotation"]]
    inputs = image_processor(images, labels)
    return inputs

train_ds.set_transform(train_transforms)
test_ds.set_transform(val_transforms)

import evaluate

metric = evaluate.load("mean_iou")

import numpy as np

import torch

from torch import nn

def compute_metrics(eval_pred):
    with torch.no_grad():
        logits, labels = eval_pred
        logits_tensor = torch.from_numpy(logits)
        logits_tensor = nn.functional.interpolate(
            logits_tensor,
            size=labels.shape[-2:],
            mode="bilinear",
            align_corners=False,
        ).argmax(dim=1)

        pred_labels = logits_tensor.detach().cpu().numpy()
        metrics = metric.compute(
            predictions=pred_labels,
            references=labels,
            num_labels=num_labels,
            ignore_index=255,
            reduce_labels=False,
        )
        for key, value in metrics.items():
            if type(value) is np.ndarray:
                metrics[key] = value.tolist()
        return metrics

from transformers import AutoModelForSemanticSegmentation, TrainingArguments, Trainer

model = AutoModelForSemanticSegmentation.from_pretrained(checkpoint, id2label=id2label, label2id=label2id)

training_args = TrainingArguments(
    output_dir="segformer-b0-scene-parse-150",
    learning_rate=6e-5,
    num_train_epochs=50,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    save_total_limit=3,
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps=20,
    eval_steps=20,
    logging_steps=1,
    eval_accumulation_steps=5,
    remove_unused_columns=False,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

trainer.train()

image = ds[0]["image"]
image

from transformers import pipeline

segmenter = pipeline("image-segmentation", model="my_awesome_seg_model")
segmenter(image)

seand0101 · November 19, 2024, 2:00pm

It was running for a while and then stops at the same error, I do found a warning that several of these functions are going to be deprecated only when I run them together as one block of code like you just wrote:


1. FutureWarning: 'cached_download' (from 'huggingface_hub.file_download') is deprecated and will be removed from version '0.26'. Use `hf_hub_download` instead.
  warnings.warn(warning_message, FutureWarning)

2. FutureWarning: 'cached_download' is the legacy way to download files from the HF hub, please consider upgrading to 'hf_hub_download'
  warnings.warn(

3. FutureWarning: 'url_to_filename' (from 'huggingface_hub.file_download') is deprecated and will be removed from version '0.26'. Use `hf_hub_download` to benefit from the new cache layout.
  warnings.warn(warning_message, FutureWarning)

If separate them, the warning is only for

FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of Transformers. Use eval_strategywarnings.warn(

instead.

Then it came back to the same error, let me copy it for comparison

FileNotFoundError: [Errno 2] No such file or directory: ‘C:\Users\Lenovo\.cache\huggingface\datasets\downloads\extracted\b557ce52f22c65030869d849d199d7b3fd5af18b335143729c717d29f6221baa\ADEChallengeData2016\annotations\training\ADE_train_00000024.png’

Currently debugging it as I can… hmmm. It still had to be something with the data downloading. Let me check the documentation first.

John6666 · November 19, 2024, 3:08pm

Warning is a hint, but you can usually ignore it. The error is still a concern.
The question is why it went to refer to that file in the first place. And it is possible that the cache is simply corrupted.
On a completely different note, there were cases where Colab did not work properly, but there were no problems in local or HF Spaces. To me, it seemed that the HF cache was malfunctioning in Colab.

seand0101 · November 20, 2024, 1:36am

Does that ADE20k files really only correlated to the Huggingface_hub parts?

John6666 · November 20, 2024, 2:42am

Does that ADE20k files really only correlated to the Huggingface_hub parts?

I hadn’t thought of that, but it seems that’s the correct answer. The script is scraping manually!
The only thing in the json is the index.
I think this is probably the cause…

seand0101 · November 20, 2024, 7:05am

There is lot’s of JSON in .cache\huggingface\datasets\downloads\ like this:

4b7e15ecb751b44e2d66e32c3e7e02e4b37bb44e1abc932d9d00922ca665982a.c603ef2b6cf2bbebe908de7d918145ee128d003c6058bc6bab205a43c7dc18c0.py

4b7e15ecb751b44e2d66e32c3e7e02e4b37bb44e1abc932d9d00922ca665982a.c603ef2b6cf2bbebe908de7d918145ee128d003c6058bc6bab205a43c7dc18c0.py.json

79e27561b3c4da91e4249a0894b9a8408ed7508080d89f2c13a486a0ad29e7c2
79e27561b3c4da91e4249a0894b9a8408ed7508080d89f2c13a486a0ad29e7c2.json

a20af833c634c197b032a1337a371e7c741c0bc15c4a4ee719d51d9f448460fb
a20af833c634c197b032a1337a371e7c741c0bc15c4a4ee719d51d9f448460fb.json

a3dcca2a336ca35407db7364e360ef0bf50f9f2848a0b2230b9bde8448a4ea0a
a3dcca2a336ca35407db7364e360ef0bf50f9f2848a0b2230b9bde8448a4ea0a.json

afab7c85dc2a7cc7d0ab0bf2926d3c64cce46c5e4d05a1c3669515efcb2124bf.ee7442fd1e36f82b5f211a315d36608250a79865dd9b645657df2d770c4972b3
afab7c85dc2a7cc7d0ab0bf2926d3c64cce46c5e4d05a1c3669515efcb2124bf.ee7442fd1e36f82b5f211a315d36608250a79865dd9b645657df2d770c4972b3.json

d888d057846ead63005c95e175267409dd51a698218ebee0edf5ab216b133dea
d888d057846ead63005c95e175267409dd51a698218ebee0edf5ab216b133dea.json
extracted

fede8a3bc39f7c1f88af1b9eff20e181e04da5d9bd0fd76b428f218fe993c1d9.98f128f6a35a2e26f136376f175a86e07f91579870bd33f90a037b77f538ef5b
fede8a3bc39f7c1f88af1b9eff20e181e04da5d9bd0fd76b428f218fe993c1d9.98f128f6a35a2e26f136376f175a86e07f91579870bd33f90a037b77f538ef5b.json

Looks like they came in pair: the .json and the python source file you just linked. Which one should I change and rerun again to make sure all the files are downloaded correctly?

because I checked there’s only one extracted folder with tree like this

> C:.
> └───extracted
>     └───b557ce52f22c65030869d849d199d7b3fd5af18b335143729c717d29f6221baa
>         └───ADEChallengeData2016
>             ├───annotations
>             │   ├───training
>             │   └───validation
>             └───images
>                 ├───training
>                 └───validation

with missing images, do you think the code didn’t read all the .json and python code for downloading the datasets? Sorry this has become so messy, I really appreciate all the help I can get because I cannot find any reading about this. ADE20k documentation is also bad and I cannot download directly from their website until now

Btw is that Sagemaker yours or they have some codes to demonstrate segmentation somewhere?

John6666 · November 20, 2024, 7:17am

Which one should I change and rerun again to make sure all the files are downloaded correctly?

Maybe this one. The other one seems to be just a list of keywords.

with missing images, do you think the code didn’t read all the .json and python code for downloading the datasets?

If that’s the case, it would have crashed at an earlier stage, or HF would have noticed when they made this sample in the first place…
But the most likely scenario in this sample is that the script is not working as expected. For example, if there have been changes on the other site, or if some files have been missed. Of course, it’s not 100%.

seand0101 · November 20, 2024, 7:23am

Can I just… you know…use my own datasets instead of these, do you know how to plug mine to the code? Like do I need to make my own dataset class or just link import as usual?

I actually have my own images for the training, I was gonna try this with theirs first to understand how they work but I just fell into deeper dungeon of debugging their deprecated changes

Update:
Found a way to make datasets from here

John6666 · November 20, 2024, 7:46am

Oh, we don’t have to debug it. Of course, I’m not a staff member either.
Let’s use a usable dataset!
So, the way to use your own dataset is to upload the images in folders, and you’re almost done.
The folder names will become the labels.

seand0101 · November 20, 2024, 7:48am

Alright at least we both found the same method for it created by them, I’ll accept this as an answer for now. Thanks John6666

seand0101 · November 20, 2024, 8:02am

The program failed because my data was unlabelled, from what I know I need to use another AI to label my raw data is that correct?

John6666 · November 20, 2024, 8:22am

Even if it’s a label, if it’s something simple like cats or dogs, it can be classified using simple image classification models.
If you can do it manually, that’s fine too.
For detailed captioning, such as for training image generation AI models, you could use models like the ones used in spaces like the one below.

seand0101 · November 20, 2024, 2:59pm

If it’s done manually it means I wrote my own uhh… what’s the format again for the labels, json? I need to lookup which one of these files are the label.

Ah okay, what we’re trying to do is to do pretrain with the previous labeled dataset so i didn’t have to label it myself right, but even so the labels are not what I expecting like the image are supposed to be representing a river but it only have something like cars or sky. Is that the purpose of the link we’re following?

the details are like this

the code
`panoptic_segmentation = pipeline(“image-segmentation”, “facebook/mask2former-swin-large-cityscapes-panoptic”)
results3 = panoptic_segmentation(image)
results3

results2_river = panoptic_segmentation(image_river)
results2_river
`

the output

[{'score': 0.986528,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>},
 {'score': 0.907122,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>},
 {'score': 0.976372,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>},
 {'score': 0.991359,
  'label': 'fence',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>},
 {'score': 0.999967,
  'label': 'vegetation',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>},
 {'score': 0.964172,
  'label': 'pole',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>},
 {'score': 0.902589,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>},
 {'score': 0.999337,
  'label': 'building',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>},
 {'score': 0.939224,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>},
 {'score': 0.994364,
  'label': 'wall',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>},
 {'score': 0.97558,
  'label': 'road',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>},
 {'score': 0.973715,
  'label': 'car',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>},
 {'score': 0.999913,
  'label': 'sky',
  'mask': <PIL.Image.Image image mode=L size=1420x1080>}]

Topic		Replies	Views
Semantic Segmentation Dataset (one label) 🤗Datasets	1	220	December 6, 2023
Chapter 7 questions Course	119	10335	July 10, 2025
Huggingface dataset install 🤗Datasets	13	2487	January 15, 2025
Colab cannot find HuggingFace dataset 🤗Datasets	7	4583	April 28, 2025
Need help in dealing with out of bounds Beginners	16	109	December 23, 2024

Missing dataset when following tutorials

Related topics