Use existing Dataset with a generator

Hello,

Is there a way to modify an existing Dataset object so that it yields multiple examples per single example?

For instance, let’s say an example has 3 fields: “image”, “text1”, “text2”. From this I would like to yield 2 examples with 2 fields each, namely I) “image”, “text” (from “text1”) and II) “image”, “text” (from “text2”).

I essentially would like to reuse an image with all the corresponding text fields without literally duplicating examples (therefore images) in the dataset.

Thanks.

From reading the documentation, my understanding is that there is only one option:

  1. Use the map function with an IterableDataset (hence lazily evaluated) instead of the DataSet set_transform function (as the latter requires the same number of examples to be returned).

When I try this, I get NotImplementedError: Sharding a VerticallyConcatenatedMultiSourcesExamplesIterable is not implemented

Hi!

You can split the examples using a batched map as follows:

def split_examples(batch):
  split_examples_batch = {"image": [], "text": []}
  for i, img in enumerate(batch["image"]):
    split_examples_batch["image"].append(img)
    split_examples_batch["text"].append(batch["text1"][i])
    split_examples_batch["image"].append(img)
    split_examples_batch["text"].append(batch["text2"][i])
  return split_examples_batch

ds = ds.map(split_examples, batched=True, remove_columns=["text1", "text2"])

Regarding the NonImplementedError, this limitation will be removed in the next release of datasets (a PR where we work on this)

Hi, thanks for the reply!

I forgot to mention that I was trying to avoid mapping the entire dataset at once because of its large size (+170GB). That’s what I initially did, but the map takes 2-3 hours at the start of EVERY training run (and inflates everything to +700GB), which is infeasible for me.

The only on-the-fly, one-to-many-examples option is the one mentioned above, right? I was hoping there’d be something else :grin:.

I forgot to mention that I was trying to avoid mapping the entire dataset at once because of its large size (+170GB). That’s what I initially did, but the map takes 2-3 hours at the start of EVERY training run (and inflates everything to +700GB), which is infeasible for me.

You can use multiprocessing (the num_proc parameter) and disable image decoding with ds= ds.cast_column("image", Image(decode=False) (turn it back on later with ds = ds.cast_column("image", Image()) to make the processing faster. Also, it’s best to run this transform on the “raw” dataset to allow reloading from the cache on later runs.

The only on-the-fly, one-to-many-examples option is the one mentioned above, right?

Correct!

1 Like