IterableDataset compute feature mean and create histogram

Hi!

I have an IterableDataset created using streaming, and I want to compute the mean of a feature named ā€œnum_tokens.ā€ Itā€™s a huge dataset that doesnā€™t fit in memory, so converting it to a Pandas DataFrame is apparently not an optionā€¦

Iā€™ve been reading that this could be accomplished using .map(), but I havenā€™t been able to do it.

I also want to graph this column in a histogram using something like this:

sns.displot(data[ā€˜num_tokensā€™], bins=100, kde=True)

Is this even possible?

Thank you very much in advance!

Hi ! What is the data format of your dataset ? Maybe itā€™s possible to load it using Dask (same as pandas but can do analytics on datasets bigger than RAM)

Thanks for your interest! My iterable dataset has four main columns:

  • One with an unique identifier for every episode
  • One containing clinical texts
  • One containing the list of the tokens of every text

I need to achive two things:

  1. Compute the average number of tokens of all texts, and, if posible, create a visualization with the distribution of the texts length.
  2. Create a vocabulary (set of unique tokens) with the tokens of all texts.

My problem is that the dataset is so big it does not fit in memory to make this computations simply using Pandas.

Following your comment, I have been reading about the possibilities offered by Dask, and it seems quite promising, so Iā€™m going to try it with that library. Anyway, if anyone knows how to achieve what I am proposing starting from the IterableDataset, it would be extremely helpful for me and maybe for others.

Thank you very much for your suggestion!