How to get maximum and minimum value of features?

amirveyseh · March 21, 2022, 4:38am

I have a large dataset with a feature of type list which has the structure of [num_steps,num_channels]. For each channel, I’d like to get the maximum and minimum value across all steps and all examples in the dataset.

I have already tried the following code using map() but it is not working:

maxs = [-math.inf]*6
mins = [math.inf]*6

def getmaxmin(example,maxs,mins):
   for tok in example['tokens']:
       for i in range(len(maxs)):
           maxs[i] = max(maxs[i],tok[i])
           mins[i] = min(mins[i],tok[i])
   return example

datasets.map(getmaxmin,
   num_proc=multiprocessing.cpu_count(),
   fn_kwargs={
       'maxs':maxs,
       'mins':mins
})

mariosasko · March 31, 2022, 12:42pm

Hi! We don’t have an official API (currently) for running aggregations directly on the underlying Arrow table, but you can use the experimental datasets_sql package that leverages DuckDB.

So in your case, to get maximum and minimum values, you would run the following queries (I’m assuming that num_channels=6):

# ... dataset initialization
from datasets_sql import query
query_max = query("SELECT MAX(tokens[0]) as max_0, MAX(tokens[1]) as max_1, MAX(tokens[2]) as max_2, MAX(tokens[3]) as max_3, MAX(tokens[4]) as max_4, MAX(tokens[5]) as max_5 FROM (SELECT unnest(tokens) as tokens FROM dataset)")
query_min = query("SELECT MIN(tokens[0]) as min_0, MIN(tokens[1]) as min_1, MIN(tokens[2]) as min_2, MIN(tokens[3]) as min_3, MIN(tokens[4]) as min_4, MIN(tokens[5]) as min_5 FROM (SELECT unnest(tokens) as tokens FROM dataset)")

Topic		Replies	Views
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1485	May 17, 2021
Most efficient way to retrieve N rows for a subset of columns 🤗Datasets	2	1517	November 3, 2021
Streaming dataset into Trainer: does not implement __len__, max_steps has to be specified 🤗Datasets	6	4479	March 21, 2023
How to operate on columns of a dataset Beginners	2	152	January 30, 2025
Get sample index within dataasets' mapping function 🤗Datasets	0	36	August 22, 2024

How to get maximum and minimum value of features?

Related topics