How can I export the statistical information of an online huggingface dataset instead of downloading the whole dataset

I’m trying to analyze the length distribution of a huggingface dataset. Huggingface already depicts its’ sequence length distribution.


how can I export a more detailed distribution instead of downloading the whole dataset?

Hi. You can use the API to get the statistics: Explore statistics over split data

2 Likes

Many thanks for your reply!
I have tried to get statistical information with the recommended method. I am wondering how to get a more detailed length distribution instead of just 10 bins.

1 Like

Oh OK. We don’t provide more than these descriptive statistics. You might have to download the whole dataset then. If the dataset is in Parquet (or if we have converted the full dataset to Parquet), you might be able to get some stats using the metadata (duckdb is able to use them without downloading the data)

1 Like