How to edit dataset to train AI?

I had a high resolution portrait photo taken and I would like to upload it for public use but more specifically, free for AI’s to train on it. I’ve uploaded the .tiff photo and make a dataset but I’m not sure where to go next.

It’s a bit long to put into words, but basically:

  • Just to be safe, delete any metadata “within the image files.”
  • Convert the image files into lightweight formats suitable for training, such as JPEG, and use those as the main part of the dataset. (You could keep the TIFF files for those who want high-quality images.) It’s faster to do this in Python… and it’s easy to have a generative AI generate the actual code.
  • Create a CSV or JSON file to be used as metadata (labels) during training. Any text file format that’s easy for you to edit will work. (Typically, users decide for themselves which fields to use and how, so it’s fine to have many metadata fields. The accuracy and precision of the labels directly impact the model’s performance. This is the “main dataset creation phase” that comes after gathering the image collection.)
  • Write the license and README.md
  • Other minor details

The process should generally follow these steps.


What to do next with your Hugging Face portrait dataset

You already did the hardest first step: you made a public Hugging Face dataset and selected CC0 1.0, which is the right license direction if your goal is broad reuse, including AI/ML training and commercial reuse. The next step is not to train a model; it is to make the dataset clear, loadable, legally understandable, and useful to people or systems that may include it in training corpora.

Right now, your dataset page shows CC0 1.0, 2 rows, a total file size of about 350 MB, an empty README, and a Dataset Viewer failure because it tried to scan 350,119,395 bytes against a 300,000,000-byte limit. That means the problem is not the idea; the problem is packaging. (Hugging Face)


1. Reframe the dataset

Use this framing:

A consented, CC0, high-resolution portrait photograph released for unrestricted public reuse, including AI/ML training, evaluation, research, commercial use, redistribution, modification, and inclusion in larger datasets.

Avoid framing it as:

A TIFF file for AI to train on.

That is too narrow. AI systems do not automatically train on every public upload. People and dataset builders are more likely to use it if the repo is easy to understand, easy to load, and legally clear.


2. Separate the archival image from the training image

Your TIFF is valuable as a source/master file, but it should not be the default training row. A 302 MB-style archival image is heavy for previews and most training pipelines. Your current viewer error is evidence that the default data is too large for convenient Hub preview. Hugging Face documents TooBigContentError as a Dataset Viewer limit issue and recommends avoiding very large first-row content or moving large payloads to separate files when possible. (Hugging Face)

Use this target structure:

README.md

train/
  ian_portrait_001.jpg
  metadata.csv

original/
  ian_portrait_001.tif

What this means:

Path Role
train/ian_portrait_001.jpg Normal dataset image; easy to preview/load/train from
train/metadata.csv Caption, license, consent, and source metadata
original/ian_portrait_001.tif Full-resolution archival source

This turns the repo from “two versions of the same image as two rows” into “one usable dataset row plus one archival source file.”


3. Create a smaller training-friendly image

Make a high-quality JPEG or PNG derivative from the TIFF.

Suggested default:

  • Format: JPEG
  • Long side: 2048 px or 1024 px
  • Color mode: RGB
  • Quality: high, but not huge
  • Metadata: stripped unless intentionally retained
  • Purpose: default loadable image

Example:

# pip install pillow

from pathlib import Path
from PIL import Image, ImageOps

source_path = Path("Ian-1.tif")
output_dir = Path("train")
output_dir.mkdir(exist_ok=True)

img = Image.open(source_path)
img = ImageOps.exif_transpose(img)
img = img.convert("RGB")

# Keep quality high while making the file practical for preview/training.
max_side = 2048
img.thumbnail((max_side, max_side))

img.save(output_dir / "ian_portrait_001.jpg", quality=95, optimize=True)

This gives users a practical default image while preserving the TIFF for people who need the full-resolution source.


4. Check and strip hidden metadata

High-resolution portraits may include EXIF/IPTC metadata such as camera model, lens, timestamp, editing software, creator fields, contact info, or GPS coordinates. Use ExifTool to inspect metadata; it is a standard tool for reading/writing image metadata. (Creative Commons)

Inspect:

exiftool Ian-1.tif
exiftool train/ian_portrait_001.jpg

Strip the public training derivative:

exiftool -all= -overwrite_original train/ian_portrait_001.jpg

Recommended approach:

File Recommendation
Private original TIFF Keep untouched locally
Public training JPG Strip metadata
Public archival TIFF Inspect; remove private/GPS/contact metadata if present
Extra PNG derivative Strip metadata if kept

5. Add metadata.csv

Create:

train/metadata.csv

Use this:

file_name,text,subject_type,source_format,archival_file,license,ai_training_permission,depicted_person_consent,rights_holder_release,no_endorsement
ian_portrait_001.jpg,"Studio portrait photograph of a young Black man, neutral expression, direct gaze, plain background.",human portrait,TIFF,original/ian_portrait_001.tif,cc0-1.0,yes,yes,yes,"Reuse does not imply endorsement by the depicted person."

Why this works:

Column Purpose
file_name Connects the row to the image file
text Caption for image-captioning/search/training workflows
subject_type States that this is a human portrait
source_format Explains the original file format
archival_file Points to the TIFF without making it a training row
license Makes the license explicit at row level
ai_training_permission States your intent clearly
depicted_person_consent Important for a recognizable person
rights_holder_release Important if a photographer/copyright holder is involved
no_endorsement Clarifies that reuse does not imply endorsement

Hugging Face’s image dataset guide supports this no-code pattern: image files plus metadata.csv, metadata.jsonl, or metadata.parquet, with file_name linking metadata to images. (Hugging Face)

Avoid naming the TIFF pointer original_file_name or archival_file_name. Since Hugging Face treats file_name / *_file_name fields as media references, use archival_file instead.


6. Replace the empty README with a real dataset card

On Hugging Face, README.md is the dataset card. Dataset-card metadata helps with license display, tags, discoverability, size, language, and data-files configuration. (Hugging Face)

Use this polished README:

---
license: cc0-1.0
pretty_name: "CC0 High-Resolution Portrait Photograph of a Young Black Man"
language:
- en
tags:
- image
- portrait
- photography
- human
- cc0
- public-domain
- ai-training
- computer-vision
- image-captioning
task_categories:
- image-to-text
- text-to-image
size_categories:
- n<1K
configs:
- config_name: default
  data_dir: train
  default: true
---

# CC0 High-Resolution Portrait Photograph of a Young Black Man

## Dataset Summary

This dataset contains a consented, high-resolution portrait photograph intentionally released for unrestricted public reuse, including AI/ML training, fine-tuning, evaluation, research, education, commercial use, redistribution, modification, and inclusion in larger datasets.

The default dataset contains a training-friendly image derivative in `train/`. The full-resolution TIFF source file is preserved separately in `original/`.

## Dataset Contents

| Path | Purpose |
|---|---|
| `train/ian_portrait_001.jpg` | Training-friendly public image |
| `train/metadata.csv` | Caption, license, consent, and source metadata |
| `original/ian_portrait_001.tif` | Archival full-resolution source image |

## Data Fields

The default dataset contains:

- `image`: the training-friendly portrait image
- `text`: factual image description
- `subject_type`: broad subject category
- `source_format`: original source format
- `archival_file`: path to the full-resolution source file
- `license`: row-level license identifier
- `ai_training_permission`: explicit AI/ML training permission
- `depicted_person_consent`: consent flag
- `rights_holder_release`: rights-holder release flag
- `no_endorsement`: no-endorsement statement

## Intended Use

This dataset may be used for:

- AI/ML training
- image-captioning examples
- text-to-image dataset experiments
- computer-vision testing
- public-domain image reuse
- research and education
- commercial and non-commercial projects
- inclusion in larger datasets

## AI/ML Training Permission

This image is intentionally released for AI/ML training, fine-tuning, evaluation, research, commercial use, redistribution, modification, and inclusion in larger datasets under CC0 1.0.

## Consent and Rights

The depicted person has consented to public release of this image for unrestricted reuse, including AI/ML training and commercial use.

The uploader represents that they have the rights necessary to release the image and its derivatives under CC0 1.0.

Reuse of this image does not imply endorsement by the depicted person, uploader, photographer, or any other contributor.

## License

This dataset is released under CC0 1.0.

No attribution is required. Attribution is appreciated but not required.

## Limitations

This is a single-image dataset. It is useful as a public portrait sample, image-captioning example, dataset-loading example, test image, or one image in a larger corpus.

It is not large enough by itself to train a robust identity model, face-recognition model, or general image-generation model.

This dataset contains one person and should not be treated as representative of any demographic group.

## Ethical Considerations

This dataset contains a recognizable human portrait. Users should consider privacy, publicity, likeness, and endorsement issues in downstream uses, even when the image is openly licensed.

Do not imply that the depicted person endorses a downstream model, product, dataset, output, or use case unless separate permission has been granted.

## How to Load

```python
from datasets import load_dataset

ds = load_dataset("jericho98/tiff-photograph-of-young-black-man", split="train")
print(ds)
print(ds[0])
```

## Citation

No citation is required under CC0. If you cite the dataset anyway, cite the Hugging Face dataset page.

The configs block matters because it tells the Hub that train/ is the default dataset directory, rather than treating every image-like file as a normal row. Hugging Face supports YAML configuration for dataset cards and data files. (Hugging Face)


7. Be precise about CC0, consent, and endorsement

CC0 is a good fit for your goal because it allows copying, modification, distribution, and commercial use without asking permission, to the extent allowed by law. (Creative Commons)

But a portrait has an extra layer: likeness, privacy, publicity, and endorsement. Creative Commons notes that CC0 does not remove rights others may have around image, likeness, privacy, or publicity. (Creative Commons)

So your dataset card should say both:

This dataset is released under CC0 1.0.

and:

The depicted person has consented to public release for unrestricted reuse, including AI/ML training and commercial use.

If you are the depicted person, use:

The depicted person is the uploader and has intentionally released this image for unrestricted public use, including AI/ML training and commercial reuse.

If a photographer took the photo, add:

The photographer/copyright holder has granted permission to release this image and its derivatives under CC0 1.0.

8. Upload/edit workflow

Easiest path: Hugging Face web UI

  1. Open your dataset.

  2. Go to Files and versions.

  3. Upload:

    • train/ian_portrait_001.jpg
    • train/metadata.csv
    • original/ian_portrait_001.tif
  4. Replace README.md with the dataset card above.

  5. Remove or move the old root-level image files.

  6. Commit with:

Restructure dataset with metadata and archival source

Hugging Face supports uploading datasets through the Hub UI, including common formats such as images, CSV, JSONL, and Parquet. (Hugging Face)

Git path

git lfs install

git clone https://huggingface.co/datasets/jericho98/tiff-photograph-of-young-black-man
cd tiff-photograph-of-young-black-man

mkdir -p train original

git mv Ian-1.tif original/ian_portrait_001.tif
cp /path/to/ian_portrait_001.jpg train/ian_portrait_001.jpg

# Create train/metadata.csv and replace README.md before this step.
git add README.md train/metadata.csv train/ian_portrait_001.jpg original/ian_portrait_001.tif
git commit -m "Restructure dataset with metadata and archival source"
git push

If you do not want to keep the old PNG:

git rm Ian-1.png
git commit -m "Remove duplicate root-level image derivative"
git push

9. Test the cleaned dataset

Install:

pip install -U datasets pillow

Test:

from datasets import load_dataset

repo_id = "jericho98/tiff-photograph-of-young-black-man"

ds = load_dataset(repo_id, split="train")

print(ds)
print(ds.features)
print(ds[0])

image = ds[0]["image"]
print(type(image), image.size, image.mode)
image.save("loaded_test.jpg")

Expected result:

Dataset({
    features: ['image', 'text', 'subject_type', ...],
    num_rows: 1
})

You want:

  • one default row,
  • working image preview,
  • caption visible,
  • license/consent fields visible,
  • TIFF preserved but not loaded as the default row,
  • no custom dataset builder script.

10. Do not use a dataset builder script

For your case, a builder script is unnecessary. Hugging Face’s ImageFolder exists so image datasets can be loaded without writing custom dataset code. (Hugging Face)

Use:

train/
  ian_portrait_001.jpg
  metadata.csv

Avoid:

dataset.py
builder.py
tiff_photograph_of_young_black_man.py

A simple file-based dataset is more durable, easier for beginners, and easier for the Hub to preview.


11. Be honest about what one photo can do

One image is useful as:

  • a public-domain portrait asset,
  • a dataset-loading example,
  • an image-captioning example,
  • a computer-vision test image,
  • an image-processing source,
  • one item in a larger training corpus,
  • a clean consent/licensing example.

One image is not enough by itself for:

  • robust face recognition,
  • general portrait generation,
  • identity modeling,
  • demographic evaluation,
  • a balanced dataset,
  • strong DreamBooth/LoRA subject personalization.

So do not oversell it. Say:

This is a single-image dataset and should not be treated as representative of any demographic group.

That is accurate and responsible.


12. Optional later: add more photos

If you later want to make this more useful for subject-personalization training, add several real photos with variation:

front-facing portrait
three-quarter angle
side angle
different expression
different lighting
different background
close-up
mid-shot
possibly full-body

Future structure:

README.md

train/
  ian_portrait_001.jpg
  ian_portrait_002.jpg
  ian_portrait_003.jpg
  metadata.csv

original/
  ian_portrait_001.tif
  ian_portrait_002.tif
  ian_portrait_003.tif

Example metadata:

file_name,text,view,expression,lighting,background,license,ai_training_permission,depicted_person_consent,archival_file
ian_portrait_001.jpg,"Studio portrait photograph of a young Black man facing the camera, neutral expression.",front,neutral,studio,plain,cc0-1.0,yes,yes,original/ian_portrait_001.tif
ian_portrait_002.jpg,"Studio portrait photograph of a young Black man at a three-quarter angle, slight smile.",three-quarter,slight smile,studio,plain,cc0-1.0,yes,yes,original/ian_portrait_002.tif

Do not add crops or filters as if they were independent originals. If you add derivatives, label them as derivatives.


13. Optional later: DOI

A DOI is useful if you want the dataset to be cited formally. Hugging Face supports DOIs for datasets and models, but DOI-linked objects are meant to be persistent, so cleanup should come first. (Hugging Face)

Wait until:

  • README is complete,
  • file structure is stable,
  • metadata is correct,
  • Dataset Viewer works,
  • load_dataset() works,
  • you are confident you will not restructure again.

Final checklist

Do first

  • Make train/ian_portrait_001.jpg.
  • Move TIFF to original/ian_portrait_001.tif.
  • Add train/metadata.csv.
  • Replace empty README with a real dataset card.
  • Remove root-level duplicate image files.
  • Test with load_dataset().

Key best practices

  • Keep TIFF as archival source, not default row.
  • Use a smaller JPG as the normal training image.
  • Add caption, consent, license, and AI-training permission.
  • Be explicit that commercial use and AI training are allowed.
  • Say reuse does not imply endorsement.
  • Do not use a custom dataset builder script.
  • Do not claim one image is a complete training corpus.

Short summary

  • Your dataset idea is good.
  • The current repo needs cleanup: large default files, 2 duplicate-ish rows, empty README, viewer failure.
  • The best version is simple: README.md, train/metadata.csv, one training-friendly JPG, and the full TIFF in original/.
  • The dataset’s value is rights clarity + consent + CC0 + easy loading, not size.