Use existing Dataset with a generator

theodor1289 · April 11, 2023, 10:38am

Hello,

Is there a way to modify an existing Dataset object so that it yields multiple examples per single example?

For instance, let’s say an example has 3 fields: “image”, “text1”, “text2”. From this I would like to yield 2 examples with 2 fields each, namely I) “image”, “text” (from “text1”) and II) “image”, “text” (from “text2”).

I essentially would like to reuse an image with all the corresponding text fields without literally duplicating examples (therefore images) in the dataset.

Thanks.

theodor1289 · April 11, 2023, 11:43am

From reading the documentation, my understanding is that there is only one option:

Use the map function with an IterableDataset (hence lazily evaluated) instead of the DataSet set_transform function (as the latter requires the same number of examples to be returned).

When I try this, I get NotImplementedError: Sharding a VerticallyConcatenatedMultiSourcesExamplesIterable is not implemented

mariosasko · April 13, 2023, 4:25pm

Hi!

You can split the examples using a batched map as follows:

def split_examples(batch):
  split_examples_batch = {"image": [], "text": []}
  for i, img in enumerate(batch["image"]):
    split_examples_batch["image"].append(img)
    split_examples_batch["text"].append(batch["text1"][i])
    split_examples_batch["image"].append(img)
    split_examples_batch["text"].append(batch["text2"][i])
  return split_examples_batch

ds = ds.map(split_examples, batched=True, remove_columns=["text1", "text2"])

Regarding the NonImplementedError, this limitation will be removed in the next release of datasets (a PR where we work on this)

theodor1289 · April 13, 2023, 5:11pm

Hi, thanks for the reply!

I forgot to mention that I was trying to avoid mapping the entire dataset at once because of its large size (+170GB). That’s what I initially did, but the map takes 2-3 hours at the start of EVERY training run (and inflates everything to +700GB), which is infeasible for me.

The only on-the-fly, one-to-many-examples option is the one mentioned above, right? I was hoping there’d be something else .

mariosasko · April 13, 2023, 11:14pm

I forgot to mention that I was trying to avoid mapping the entire dataset at once because of its large size (+170GB). That’s what I initially did, but the map takes 2-3 hours at the start of EVERY training run (and inflates everything to +700GB), which is infeasible for me.

You can use multiprocessing (the num_proc parameter) and disable image decoding with ds= ds.cast_column("image", Image(decode=False) (turn it back on later with ds = ds.cast_column("image", Image()) to make the processing faster. Also, it’s best to run this transform on the “raw” dataset to allow reloading from the cache on later runs.

The only on-the-fly, one-to-many-examples option is the one mentioned above, right?

Correct!

Topic		Replies	Views
Making multiple samples from single samples using HuggingFace Datasets 🤗Datasets	6	2687	March 3, 2024
Mapping 1 multi-element column of a dataset to multi row dataset with 1 element per row, duplicating other features 🤗Datasets	6	2562	November 4, 2022
Skip rows with datasets.Dataset.map() 🤗Datasets	1	1738	January 3, 2023
Weird example of batching in Dataset.map document 🤗Datasets	4	1043	September 4, 2023
One-to-many augmentations on the fly 🤗Datasets	6	960	April 6, 2023

Use existing Dataset with a generator

Related topics