Image classification

I need to train my model with tiles that come from 300 whole slide images. When the dataset is loaded using load_dataset, I loose the filename, therefore I am unable to see were the tiles come from. When I use train_test_split, it’s very important that all tiles that come from the same image are put in the same classification. How can I keep the filename so that I can make sure this works correctly?

You’ll be able to fetch the filenames once Return the name of the currently loaded file in the load_dataset function. · Issue #5806 · huggingface/datasets · GitHub is addressed.

In the meantime, this should work:

ds = load_dataset(...)
...
ds = ds.map(lambda ex: {"filename": os.path.basename(ex["image"].filename) if ex["image"].filename else None})