I recently encountered an interesting concern. Last weekend I participated in a small hackathon where one of the challenges was to identify rubber ducks on a course. Our robot had a RaspberryPi 5 onboard and we wanted to use YOLO to identify the locations of the ducks.
The “teddy bear” class was overly aggressive in identifying teddy bears, so we changed the display name to “rubber duck” at inference time. This worked fine for our very limited use case, but we also wondered… could we improve this capability if we fine tuned more ‘teddy bears’ using actual rubber duck images? This led to adding bounding box annotations to existing datasets and fine tuning YOLOv8n.
I thought it could be useful to publish our dataset and model, but when putting together the model card and setting our own license I realized we have a problem… the datasets in question do not have any license identified, and HF defaults to the publisher, so it is not clear to me whether or not we are reasonably able to use “public” datasets for our own purposes, train a model on those datasets, or publish our own updated dataset/model, etc…
I’m posting here for a few reasons, most importantly to seek clarification on expectations on how to approach such things, as I now do not think that “public” necessarily means usable and it’s very interesting to me that this is how things are setup. Our use case is probably fairly obscure and not that interesting, but in the future we could do similar things which are more important.
My initial reaction was to pull the dataset and model, but after some exploration with an LLM it seemed that this could be an opportunity for learning.
The datasets in question:
Related dataset and model:
FYI @Norod78
I wasn’t able to tag HF Staff member linoyts as they don’t seem to have an account on the forum, but did send them a LinkedIn message that may end up in spam. Linoy Tsaban in case anyone knows them or their handle here.