How to clean 8217 pictures from the similar one

@John6666 Dear John and other huggingface community members,

Do you guys know where to quickly clean from the similar image and labeled your image for dataset purposes? Thanks in advance.

2 Likes

I know how to automatically tag images in detail to a certain extent, but I don’t know how to process large image data sets concisely, so I searched around for a solution.
It seems there are several specialized tools.

Exactly what I need, that for some reason didn’t appear in my search engine. Will try this and reply if I find difficulties, thanks as always John6666

1 Like

Is SafeSearch turned on, or are your personal settings being applied? I searched using Bing.

Then probably it’s google that sucks these days or just not lucky. I almost want to clean these data manually

1 Like

That’s right. Google search used to be amazing…

Even now, Google is the most powerful when it comes to setting detailed search options. (Like site:huggingface.co)
I usually just use Bing or DuckDuckGo, but I sometimes use this when I’m in a bind. https://searx.bndkt.io/ If even that doesn’t work, there are also paid services by someone out there, but I don’t have anything that I want to search for that much…

One more thing tho, which part of these

that is the hf dataset identifier mentioned in this line of code

import json
from huggingface_hub import hf_hub_download

repo_id = f"datasets/{hf_dataset_identifier}"
filename = "id2label.json"
id2label = json.load(open(hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset"), "r"))
id2label = {int(k): v for k, v in id2label.items()}
label2id = {v: k for k, v in id2label.items()}

num_labels = len(id2label)

in this tutorial Fine-Tune a Semantic Segmentation Model with a Custom Dataset

I some how an expert in making things in huggingface tutorial doesn’t work, is it because that dataset doesn’t have the label? Thanks in advance

1 Like

GATED AGAIN…

Don’t worry I don’t want to use that gated dataset … is the cityscape useable?

1 Like

nvidia/segformer-b5-finetuned-cityscapes-1024-1024 is a model. not a dataset… So, I guess you can’t use it from the datasets library?

Edit:
dataset one is this?

What is the difference though, and what should I do if I want to do pretrain with bigger model to train on smaller dataset to “fine tune it”. I think I got confused of it

1 Like

What is the difference though

In the extreme, the only difference is whether you choose a dataset or a model when creating a new repo in HF. You can basically put anything in either, so

However, it is usually more convenient to put the models that have not been fine-tuned or have been fine-tuned in the model repo, and to put the datasets in the datasets repo, so most people do this.

I think the datasets library is designed to use the datasets in the datasets repo. So you need to find the datasets in the datasets repo. Probably.

But can I fine tune with my own dataset with either of them or only with models? Is that what is called pretraining?

1 Like

Of course you can. Basically, I think it’s a matter of training a model with a data set, saving it to a new model repo, and repeating the process. You can also overwrite it instead of creating a new one, but in any case, I think that’s the method HF is assuming.
So you should be able to train it further using nvidia’s as a base model and save it for your own use.
I don’t really know the definition of pretraining…

segments.ai is way too slow, does anyone know any guide to finetune with unlabelled images?

1 Like

Idk why I edited previous message for this question. I’ll make it a reply to that.

segments.ai is way too slow, does anyone know any guide to finetune with unlabelled images?

Whether or not this is possible will depend on what you train the model on.
For example, if you train the model for a classification task, you need to tell it what it is classifying…
You could use the image folder name as a label, but it would be difficult to do without anything…

I find guides that let me uses image processing mask instead of classification-neural-network way of masking. I might wanna try that, better than nothing…

1 Like

Well, unless you have a specific end goal in mind, I think it’s a good idea to try out different things.
I’ve heard that the quality of the data set and its labels has a direct impact on the efficiency and accuracy of the model training.
But that’s something you can think about when you’re training the main model at the end. There’s not much point in improving the performance of the model you’re training on a trial basis…