IterableDataset compute feature mean and create histogram

Elle518 · May 11, 2023, 7:18am

Hi!

I have an IterableDataset created using streaming, and I want to compute the mean of a feature named “num_tokens.” It’s a huge dataset that doesn’t fit in memory, so converting it to a Pandas DataFrame is apparently not an option…

I’ve been reading that this could be accomplished using .map(), but I haven’t been able to do it.

I also want to graph this column in a histogram using something like this:

sns.displot(data[‘num_tokens’], bins=100, kde=True)

Is this even possible?

Thank you very much in advance!

lhoestq · May 12, 2023, 9:35am

Hi ! What is the data format of your dataset ? Maybe it’s possible to load it using Dask (same as pandas but can do analytics on datasets bigger than RAM)

Elle518 · May 15, 2023, 2:55pm

Thanks for your interest! My iterable dataset has four main columns:

One with an unique identifier for every episode
One containing clinical texts
One containing the list of the tokens of every text

I need to achive two things:

Compute the average number of tokens of all texts, and, if posible, create a visualization with the distribution of the texts length.
Create a vocabulary (set of unique tokens) with the tokens of all texts.

My problem is that the dataset is so big it does not fit in memory to make this computations simply using Pandas.

Following your comment, I have been reading about the possibilities offered by Dask, and it seems quite promising, so I’m going to try it with that library. Anyway, if anyone knows how to achieve what I am proposing starting from the IterableDataset, it would be extremely helpful for me and maybe for others.

Thank you very much for your suggestion!

Topic		Replies	Views
Copy columns in a dataset and compute statistics for a column 🤗Datasets	13	1984	July 10, 2024
Improve performance IterableDataset (with tokenization) 🤗Datasets	2	771	November 2, 2023
Iterable datasets features 🤗Datasets	5	2745	September 8, 2022
Training a Tokenizer on a Streamed Dataset Beginners	5	1342	May 30, 2023
Map with tokenize function stuck in the beginning 🤗Datasets	4	57	December 27, 2024

IterableDataset compute feature mean and create histogram

Related topics