How to clean 8217 pictures from the similar one

seand0101 · November 28, 2024, 3:16am

@John6666 Dear John and other huggingface community members,

Do you guys know where to quickly clean from the similar image and labeled your image for dataset purposes? Thanks in advance.

John6666 · November 28, 2024, 3:25am

I know how to automatically tag images in detail to a certain extent, but I don’t know how to process large image data sets concisely, so I searched around for a solution.
It seems there are several specialized tools.

seand0101 · November 28, 2024, 3:30am

Exactly what I need, that for some reason didn’t appear in my search engine. Will try this and reply if I find difficulties, thanks as always John6666

John6666 · November 28, 2024, 3:32am

Is SafeSearch turned on, or are your personal settings being applied? I searched using Bing.

seand0101 · November 28, 2024, 3:32am

Then probably it’s google that sucks these days or just not lucky. I almost want to clean these data manually

John6666 · November 28, 2024, 3:37am

That’s right. Google search used to be amazing…

Even now, Google is the most powerful when it comes to setting detailed search options. (Like site:huggingface.co)
I usually just use Bing or DuckDuckGo, but I sometimes use this when I’m in a bind. https://searx.bndkt.io/ If even that doesn’t work, there are also paid services by someone out there, but I don’t have anything that I want to search for that much…

seand0101 · November 28, 2024, 4:12am

One more thing tho, which part of these

that is the hf dataset identifier mentioned in this line of code

import json
from huggingface_hub import hf_hub_download

repo_id = f"datasets/{hf_dataset_identifier}"
filename = "id2label.json"
id2label = json.load(open(hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset"), "r"))
id2label = {int(k): v for k, v in id2label.items()}
label2id = {v: k for k, v in id2label.items()}

num_labels = len(id2label)

in this tutorial Fine-Tune a Semantic Segmentation Model with a Custom Dataset

I some how an expert in making things in huggingface tutorial doesn’t work, is it because that dataset doesn’t have the label? Thanks in advance

John6666 · November 28, 2024, 4:41am

GATED AGAIN…

seand0101 · November 28, 2024, 4:44am

Don’t worry I don’t want to use that gated dataset … is the cityscape useable?

John6666 · November 28, 2024, 4:47am

nvidia/segformer-b5-finetuned-cityscapes-1024-1024 is a model. not a dataset… So, I guess you can’t use it from the datasets library?

Edit:
dataset one is this?

seand0101 · November 28, 2024, 5:20am

What is the difference though, and what should I do if I want to do pretrain with bigger model to train on smaller dataset to “fine tune it”. I think I got confused of it

John6666 · November 28, 2024, 5:39am

What is the difference though

In the extreme, the only difference is whether you choose a dataset or a model when creating a new repo in HF. You can basically put anything in either, so

However, it is usually more convenient to put the models that have not been fine-tuned or have been fine-tuned in the model repo, and to put the datasets in the datasets repo, so most people do this.

I think the datasets library is designed to use the datasets in the datasets repo. So you need to find the datasets in the datasets repo. Probably.

seand0101 · November 28, 2024, 6:31am

But can I fine tune with my own dataset with either of them or only with models? Is that what is called pretraining?

John6666 · November 28, 2024, 6:36am

Of course you can. Basically, I think it’s a matter of training a model with a data set, saving it to a new model repo, and repeating the process. You can also overwrite it instead of creating a new one, but in any case, I think that’s the method HF is assuming.
So you should be able to train it further using nvidia’s as a base model and save it for your own use.
I don’t really know the definition of pretraining…

seand0101 · November 28, 2024, 6:51am

segments.ai is way too slow, does anyone know any guide to finetune with unlabelled images?

seand0101 · November 28, 2024, 9:24am

Idk why I edited previous message for this question. I’ll make it a reply to that.

segments.ai is way too slow, does anyone know any guide to finetune with unlabelled images?

John6666 · November 28, 2024, 11:58am

Whether or not this is possible will depend on what you train the model on.
For example, if you train the model for a classification task, you need to tell it what it is classifying…
You could use the image folder name as a label, but it would be difficult to do without anything…

seand0101 · November 28, 2024, 12:16pm

I find guides that let me uses image processing mask instead of classification-neural-network way of masking. I might wanna try that, better than nothing…

John6666 · November 28, 2024, 1:50pm

Well, unless you have a specific end goal in mind, I think it’s a good idea to try out different things.
I’ve heard that the quality of the data set and its labels has a direct impact on the efficiency and accuracy of the model training.
But that’s something you can think about when you’re training the main model at the end. There’s not much point in improving the performance of the model you’re training on a trial basis…

Topic		Replies	Views
How to clean/audit your image data? 🤗Datasets	1	1005	April 21, 2023
Create custom dataset for image classification Beginners	0	124	March 28, 2024
Missing dataset when following tutorials Beginners	21	678	November 20, 2024
Semantic Segmentation Dataset (one label) 🤗Datasets	1	220	December 6, 2023
Unable to load images 🤗Datasets	2	147	December 31, 2024

How to clean 8217 pictures from the similar one

Related topics