But the problem with this solution is that map is not an iterable, i.e., it cannot be re-iterated:
for v in column_values_only_ds:
print(v) # Prints "Good" and "Bad"
for v in column_values_only_ds:
print(v) # Prints nothing
So, how can I create an iterable that returns only column values?
P.S. I’m building a single interface for running experiments with different models and, e.g., FastText requires only lists of strings, not dictionaries.
If you want to iterate over just the "text" column in your IterableDataset and make sure it can be re-iterated (unlike map), you can use a generator function. This way, you’ll always get a fresh iterable whenever you need it.
Here’s how you can do it:
from datasets import IterableDataset
# Your original dataset generator
def gen():
yield {"text": "Good", "label": 0}
yield {"text": "Bad", "label": 1}
ds = IterableDataset.from_generator(gen)
# A function to pull only the "text" values
def extract_text_column(dataset):
for item in dataset:
yield item["text"]
# A callable that gives you a fresh iterator each time
column_values_only_ds = lambda: extract_text_column(ds)
# Now, let's iterate over the "text" column
for v in column_values_only_ds():
print(v) # Prints "Good" and "Bad"
# You can do it again without issues!
for v in column_values_only_ds():
print(v) # Prints "Good" and "Bad" again
Generator Function: extract_text_column(dataset) is like a recipe to grab just the "text" values one at a time.
Fresh Start: Each time you call column_values_only_ds(), it gives you a brand-new iterator. So, no matter how many times you loop, it works!
Simple and Reusable: This makes it super handy if you’re building experiments or pipelines where re-iteration matters.
I hope this clears things up and helps you with your project. Feel free to reach out if you have more questions. Happy coding!
While this works, it loses the functionality of the IterableDataset (its methods and attributes are no longer accessible), so I hoped for a built in Datasets solution, but your answer suggests that there is no such functionality. OK.
By the way, something like this should also work:
class IterableDatasetColumnGetter:
def __init__(self, dataset: IterableDataset, column_name: str) -> None:
self.dataset = dataset
self.column_name = column_name
def __iter__(self) -> Iterator:
return iter(map(lambda x: x[self.column_name], self.dataset))
iterable_column_values_only_ds = IterableDatasetColumnGetter(ds, "text")
for v in iterable_column_values_only_ds:
print(v) # Prints "Good" and "Bad"
for v in iterable_column_values_only_ds:
print(v) # Prints "Good" and "Bad" again
but again it looks like it is not a good solution due to the loss of the original functionality.