Data exploration/visualisation

Hello everyone,

I am new to datasets library.
I have a quick question - do datasets library provide some out of box alternatives for common pandas functions such as value_counts, groupby, mean etc. - essentially anything that requires operation over columns.

A quick search via Google/ChatGpt doesn’t reveal a straightforward solution. I also couldn’t find any solution in huggingface documentation - map, select, filter - all of them apply row-wise.

If there is no native way to do them in datasets is it because they are yet to be incorporated or can’t be done due to fundamental limitations of how datasets is built i.e. trading versatile functions to gain speed/performance?

Warm regards,
Varshit Dusad

PS, I am aware of back and forth conversion with pandas. Just wanted to know if there is non-pandas way to go about it (especially dealing with dataset that can be too hard to fit in memory).

Hi! datasets’ data processing capabilities are focused on model training. For Pandas-like data exploration, you can pass a Dataset’s underlying Arrow table (the dataset.data.table attribute) to libraries specialized for this, such as DuckDB or Polars.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.