Uploading to a private repository is fine, and I think it’s okay to use it for research, but linking to it in a paper via public or gated repositories might be problematic…
What you’re trying to do
You have:
- 1,300 short clips that you created by trimming YouTube videos (serial-related).
- A plan to publish a CSV/Excel manifest (ID, duration, channel/publisher, serial name, labels for emotion analysis).
Your core question is whether you can share a link to the dataset in a paper, e.g., by uploading it to Hugging Face, and whether adding a request form / gated access would make it acceptable.
The key distinction (this decides most of the answer)
A) Publishing the actual clipped video files (MP4s / audio / frames)
In most cases, this is not allowed unless you have clear redistribution rights (permission/licenses) for the underlying content.
Why:
- YouTube Terms restrict reproducing/downloading/distributing/altering content except as expressly authorized, or with written permission from YouTube and (where applicable) the rights holders. (YouTube)
- If your clips contain TV/serial footage, the uploader often does not own the rights, so re-hosting those clips elsewhere is high-risk (copyright + takedowns).
B) Publishing a manifest + annotations (recommended)
This typically means publishing:
- YouTube video IDs / URLs
- timestamps for each clip (start/end, duration)
- channel info, serial name, and your emotion labels
- optionally non-reconstructive derived features (embeddings)
This is the standard pattern used by major research datasets that rely on YouTube content without redistributing the media itself (e.g., AudioSet releases CSV entries with YouTube ID + start/end time + labels). (research.google.com)
Kinetics similarly distributes URLs + temporal intervals. (arXiv)
Can you upload it to Hugging Face and link it in a paper?
If you upload only the CSV/Excel manifest + labels
Generally yes (and it is common), assuming you’re not including copyrighted media files.
Hugging Face is a hosting platform; it also has an IP takedown process (DMCA-style), so you should publish only what you have the right to distribute. (Hugging Face)
For academic citation stability, Hugging Face supports minting a DOI for dataset repos. (Hugging Face)
If you upload the clipped videos
You may be subject to removal/takedown and other issues if the content is infringing. Hugging Face explicitly provides a process for reporting IP infringement and can remove content. (Hugging Face)
Does a “request form” / gated access make uploading clips OK?
No—gating is not a legal shield.
What gating does:
- Hugging Face “gated datasets” require users to request access and share contact info; you can add extra fields (a “form”). (Hugging Face)
What gating does not do:
- It does not grant you redistribution rights for copyrighted clips.
- If the clips infringe copyright or violate platform terms, the dataset can still be taken down. (Hugging Face)
Operational pitfall:
- Access requests are designed to be handled via browser workflow (not ideal for fully automated access). (Hugging Face Forums)
Background: Why “download + reupload clips” is the risky part
Even if your intent is academic:
- YouTube’s Terms limit downloading and redistribution except as explicitly authorized. (YouTube)
- YouTube’s API policies also explicitly prohibit downloading/caching/storing copies of YouTube audiovisual content without prior written approval. (Google for Developers)
Separately, copyright law (fair use / quotation / research exceptions) varies by country and is fact-specific; but even if a legal exception might apply in some contexts, that doesn’t automatically make public redistribution of clips (hosting a clip archive) safe.
A practical, low-risk plan for your dataset (what to publish)
1) Publish a “pointer dataset” (public on Hugging Face)
Upload only:
clip_id
youtube_video_id and/or youtube_url
start_sec, end_sec, duration_sec
channel_id, channel_title (publisher)
serial_name
- emotion labels (and annotation metadata: label set, guidelines version, annotator count, agreement if you have it)
collected_at, and optionally last_checked_at + availability_status
This aligns with common precedents (AudioSet, Kinetics). (research.google.com)
2) In your dataset card, say explicitly
- You do not redistribute video/audio; you release IDs/URLs + timestamps + annotations.
- Videos may disappear over time (link rot), and you version the dataset accordingly (Kinetics explicitly notes this ecosystem reality). (arXiv)
- Provide a simple takedown/removal contact (even for pointer datasets, some creators may object).
3) If you want controlled access, gate only the annotations/features (optional)
Use gated datasets if you need:
- tracking access for ethics reasons, or
- requiring users to agree to conditions (e.g., “no surveillance use”)
Mechanically, HF supports gated datasets + custom fields. (Hugging Face)
But keep the dataset media-free.
4) Add a DOI for paper citation
Generate a DOI on the HF repo so the paper can cite a stable identifier. (Hugging Face)
When could you publish the clips themselves?
Only when you can show a clear redistribution basis, e.g.:
- you filmed/own the videos; or
- you have written permission; or
- the videos are genuinely under a license permitting reuse (YouTube’s CC BY option exists, but you still need confidence the uploader had the rights to apply it). (Google Help)
One-page decision rule
- Want to share publicly in a paper link? Publish manifest + labels, not clips. (YouTube)
- Want a request form? Use HF gated datasets for annotations/features; it’s access control, not permission. (Hugging Face)
- Want to host the video clips? Assume “no” unless you can document redistribution rights.