Use of unlicensed HF datasets

danielritchie · January 30, 2025, 7:48pm

I recently encountered an interesting concern. Last weekend I participated in a small hackathon where one of the challenges was to identify rubber ducks on a course. Our robot had a RaspberryPi 5 onboard and we wanted to use YOLO to identify the locations of the ducks.

The “teddy bear” class was overly aggressive in identifying teddy bears, so we changed the display name to “rubber duck” at inference time. This worked fine for our very limited use case, but we also wondered… could we improve this capability if we fine tuned more ‘teddy bears’ using actual rubber duck images? This led to adding bounding box annotations to existing datasets and fine tuning YOLOv8n.

I thought it could be useful to publish our dataset and model, but when putting together the model card and setting our own license I realized we have a problem… the datasets in question do not have any license identified, and HF defaults to the publisher, so it is not clear to me whether or not we are reasonably able to use “public” datasets for our own purposes, train a model on those datasets, or publish our own updated dataset/model, etc…

I’m posting here for a few reasons, most importantly to seek clarification on expectations on how to approach such things, as I now do not think that “public” necessarily means usable and it’s very interesting to me that this is how things are setup. Our use case is probably fairly obscure and not that interesting, but in the future we could do similar things which are more important.

My initial reaction was to pull the dataset and model, but after some exploration with an LLM it seemed that this could be an opportunity for learning.

The datasets in question:

Related dataset and model:

FYI @Norod78

I wasn’t able to tag HF Staff member linoyts as they don’t seem to have an account on the forum, but did send them a LinkedIn message that may end up in spam. Linoy Tsaban in case anyone knows them or their handle here.

Norod78 · May 25, 2025, 10:25am

I just saw your post now, sorry. The duck images in my dataset where gathered randomly from the internet so I don’t have any say about it really. I made it public so people could use it (for training rubber duck LoRA / classification / whatever), but obviously the images themselves do not belong to me.

Topic		Replies	Views
Share your projects! Course	19	3852	February 18, 2025
Can we collect crowd source dataset via Huggingface Dataset? 🤗Datasets	1	253	January 18, 2024
New disk usage quota for Hugging Face users, from December 2024 Beginners	3	183	December 11, 2024
Collaborating with HuggingFace on Python Integration? Site Feedback	1	22	February 3, 2025
RFC: Licensing datasets that alter existing datasets 🤗Datasets	0	343	May 29, 2023

Use of unlicensed HF datasets

Related topics