Mapping 1 multi-element column of a dataset to multi row dataset with 1 element per row, duplicating other features

Apologies for the spam.

I am currently trying to process an image dataset into a new representation,
Essentially I am converting an image into a set of sub-images, I would like each sub image to be a single example in the dataset.
Currently I have the dataset of the form

Dataset({
    features: [ā€˜coordinatesā€™, ā€˜filenameā€™, ā€˜imgā€™, ā€˜labelā€™, ā€˜full_labelā€™],
    num_rows: 11969
})

and I have a map function which converts it to:

Dataset({
    features: [ā€˜coordinatesā€™, ā€˜filenameā€™, ā€˜sub_imagesā€™, ā€˜labelā€™, ā€˜full_labelā€™],
    num_rows: 11969
})

where ā€œsub_imagesā€ is a list containing n sub-images

I would like to convert this new dataset to the form:

Dataset({
    features: [ā€˜coordinatesā€™, ā€˜filenameā€™, ā€˜imgā€™, ā€˜labelā€™, ā€˜full_labelā€™],
    num_rows: 11969*n
})

Where each sub-images field ā€œunrollsā€ into n separate rows, duplicating the corresponding coordinates, filenames, labels and full_labels. I have attempted this with batched mapping with the following function.

def patches_to_examples(example):
    return{"label": [example["label"] for _ in example["sub_images"]], 
           "full_label": [example["full_label"] for _ in example["sub_images"]], 
           "filename": [example["filename"] for _ in example["sub_images"]], 
           "coordinates": [example["coordinates"] for _ in example["sub_images"]], 
           "img":[np.array(image) for image in example["sub_images"]]}

ds = ds.map(patches_to_examples, batched = True, remove_columns = ds.column_names)

however this only creates 1 row per example and stacks the images in a list. where I would like it to create len(sub_images) rows per example, with one image per row.
Any suggestions on where Iā€™m going wrong?
Cheers in advance!

Hi ! This function takes a batch of examples as input, so you have to do two for loops: one to loop over the examples, and one to loop over the sub images:

def patches_to_examples(batch):
    return {
        "label": [label for i, label in enumerate(batch["label"]) for _ in batch["sub_images"][i]], 
        "full_label": [full_label for i, full_label in enumerate(batch["full_label"]) for _ in batch["sub_images"][i]], 
        "filename": [filename for i, filename in enumerate(batch["filename"]) for _ in batch["sub_images"][i]], 
        "coordinates": [coordinates for i, coordinates in enumerate(batch["coordinates"]) for _ in batch["sub_images"][i]], 
        "img":[np.array(sub_image) for sub_images in batch["sub_images"] for sub_image in sub_images]
    }

alternatively you can try using pandas explode function:

ds.with_format("pandas").map(lambda df: df.explode("sub_images"))

(you might need to do df.explode("sub_images").dropna() otherwise empty sub_images might create some NaNs)

2 Likes

Thanks very much!
Problem solved

Hello @lhoestq, as we think about performace considering memory and processing. How would you compare Converting DF to Datasets then applying your solution AND applying explode beforehand then converting DF to Datasets?

Thank you!

Pandas is not memory efficient : if your dataset is too big, youā€™ll run out of memory.

Contrary to pandas, datasets uses memory mapping (so you donā€™t fill your RAM when loading the data) and processes the data with map in batches (so you always have a mamximum of batch_size rows in RAM).

Using datasets you can load and process big datasets without running out of memory.

Though it might be a bit slower because it reads data from your disk

Btw to use memory mapping you need to specify where to save the dataset on disk. You can do

from datasets import Dataset, load_from_disk 

ds = Dataset.from_pandas(df)
ds.save_to_disk("path/to/dir")
ds = load_from_disk("path/to/dir")

Though there are discussions to memory map a dataset by default when doing from_pandas/from_dict/from_list (i.e. when data come from memory). Let me know what you think

Thanks for the explanation, very informative.

I was not aware that ā€œfrom_pandas/from_dict/from_listā€ functions not managing memory mapping. I had a use case which dataset is initialized using ā€˜from_pandasā€™, and did not realize due to testing on very small data. It will also be really helpful that mapping still apply when data used from memory!

So, for the comparison of performance, if memory mapping applied, using dataset will be more advantageous in this case.