How to iterate over values of a column in the IterableDataset?

Suppose we have a simple iterable dataset from the documentation:

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)

and suppose I want to iterate over the "text" column values. An obvious solution can be the following:

column_values_only_ds = map(lambda x: x["text"], ds)

But the problem with this solution is that map is not an iterable, i.e., it cannot be re-iterated:

for v in column_values_only_ds:
    print(v)  # Prints "Good" and "Bad"
for v in column_values_only_ds:
    print(v)  # Prints nothing

So, how can I create an iterable that returns only column values?

P.S. I’m building a single interface for running experiments with different models and, e.g., FastText requires only lists of strings, not dictionaries.

1 Like

Hi there! :blush:

If you want to iterate over just the "text" column in your IterableDataset and make sure it can be re-iterated (unlike map), you can use a generator function. This way, you’ll always get a fresh iterable whenever you need it.

Here’s how you can do it:

from datasets import IterableDataset

# Your original dataset generator
def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)

# A function to pull only the "text" values
def extract_text_column(dataset):
    for item in dataset:
        yield item["text"]

# A callable that gives you a fresh iterator each time
column_values_only_ds = lambda: extract_text_column(ds)

# Now, let's iterate over the "text" column
for v in column_values_only_ds():
    print(v)  # Prints "Good" and "Bad"

# You can do it again without issues!
for v in column_values_only_ds():
    print(v)  # Prints "Good" and "Bad" again
  • Generator Function: extract_text_column(dataset) is like a recipe to grab just the "text" values one at a time.
  • Fresh Start: Each time you call column_values_only_ds(), it gives you a brand-new iterator. So, no matter how many times you loop, it works!
  • Simple and Reusable: This makes it super handy if you’re building experiments or pipelines where re-iteration matters.

I hope this clears things up and helps you with your project. Feel free to reach out if you have more questions. Happy coding! :rocket:

1 Like

Thank you for the answer!

While this works, it loses the functionality of the IterableDataset (its methods and attributes are no longer accessible), so I hoped for a built in :hugs:Datasets solution, but your answer suggests that there is no such functionality. OK.

By the way, something like this should also work:

class IterableDatasetColumnGetter:
    def __init__(self, dataset: IterableDataset, column_name: str) -> None:
        self.dataset = dataset
        self.column_name = column_name

    def __iter__(self) -> Iterator:
        return iter(map(lambda x: x[self.column_name], self.dataset))

iterable_column_values_only_ds = IterableDatasetColumnGetter(ds, "text")

for v in iterable_column_values_only_ds:
    print(v)  # Prints "Good" and "Bad"

for v in iterable_column_values_only_ds:
    print(v) # Prints "Good" and "Bad" again

but again it looks like it is not a good solution due to the loss of the original functionality.

2 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Hi ! Could it be interesting to implement a IterableColumn ? What do you think of something like this ?

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)
texts = ds["text"]  # `texts` is an IterableColumn object

for v in texts:
    print(v)

If you like this API, feel free to suggest it in an issue on gtihub or open a PR :slight_smile:

2 Likes