How to iterate over values of a column in the IterableDataset?

Innovator2K · January 14, 2025, 11:33am

Suppose we have a simple iterable dataset from the documentation:

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)

and suppose I want to iterate over the "text" column values. An obvious solution can be the following:

column_values_only_ds = map(lambda x: x["text"], ds)

But the problem with this solution is that map is not an iterable, i.e., it cannot be re-iterated:

for v in column_values_only_ds:
    print(v)  # Prints "Good" and "Bad"
for v in column_values_only_ds:
    print(v)  # Prints nothing

So, how can I create an iterable that returns only column values?

P.S. I’m building a single interface for running experiments with different models and, e.g., FastText requires only lists of strings, not dictionaries.

Alanturner2 · January 14, 2025, 1:10pm

Hi there!

If you want to iterate over just the "text" column in your IterableDataset and make sure it can be re-iterated (unlike map), you can use a generator function. This way, you’ll always get a fresh iterable whenever you need it.

Here’s how you can do it:

from datasets import IterableDataset

# Your original dataset generator
def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)

# A function to pull only the "text" values
def extract_text_column(dataset):
    for item in dataset:
        yield item["text"]

# A callable that gives you a fresh iterator each time
column_values_only_ds = lambda: extract_text_column(ds)

# Now, let's iterate over the "text" column
for v in column_values_only_ds():
    print(v)  # Prints "Good" and "Bad"

# You can do it again without issues!
for v in column_values_only_ds():
    print(v)  # Prints "Good" and "Bad" again

Generator Function: extract_text_column(dataset) is like a recipe to grab just the "text" values one at a time.
Fresh Start: Each time you call column_values_only_ds(), it gives you a brand-new iterator. So, no matter how many times you loop, it works!
Simple and Reusable: This makes it super handy if you’re building experiments or pipelines where re-iteration matters.

I hope this clears things up and helps you with your project. Feel free to reach out if you have more questions. Happy coding!

Innovator2K · January 14, 2025, 2:07pm

Thank you for the answer!

While this works, it loses the functionality of the IterableDataset (its methods and attributes are no longer accessible), so I hoped for a built in Datasets solution, but your answer suggests that there is no such functionality. OK.

By the way, something like this should also work:

class IterableDatasetColumnGetter:
    def __init__(self, dataset: IterableDataset, column_name: str) -> None:
        self.dataset = dataset
        self.column_name = column_name

    def __iter__(self) -> Iterator:
        return iter(map(lambda x: x[self.column_name], self.dataset))

iterable_column_values_only_ds = IterableDatasetColumnGetter(ds, "text")

for v in iterable_column_values_only_ds:
    print(v)  # Prints "Good" and "Bad"

for v in iterable_column_values_only_ds:
    print(v) # Prints "Good" and "Bad" again

but again it looks like it is not a good solution due to the loss of the original functionality.

system · January 15, 2025, 2:07am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

lhoestq · January 27, 2025, 10:42am

Hi ! Could it be interesting to implement a IterableColumn ? What do you think of something like this ?

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)
texts = ds["text"]  # `texts` is an IterableColumn object

for v in texts:
    print(v)

If you like this API, feel free to suggest it in an issue on gtihub or open a PR

Topic		Replies	Views
Why doesn't an iterable dataset have a column_names featuer? 🤗Datasets	1	627	June 5, 2023
How do I iterate through <class 'datasets.dataset_dict.IterableDatasetDict'>? Beginners	2	2615	January 15, 2024
Iterating over Dataset with type='torch' columns 🤗Datasets	1	2702	December 5, 2022
Creating sharded IterableDataset from a list of IterableDatasets? Intermediate	2	532	July 2, 2024
IterableDataset.from_generator with iterator 🤗Datasets	2	1476	November 18, 2023

How to iterate over values of a column in the IterableDataset?

Related topics