How to get maximum and minimum value of features?

I have a large dataset with a feature of type list which has the structure of [num_steps,num_channels]. For each channel, I’d like to get the maximum and minimum value across all steps and all examples in the dataset.

I have already tried the following code using map() but it is not working:

maxs = [-math.inf]*6
mins = [math.inf]*6

def getmaxmin(example,maxs,mins):
   for tok in example['tokens']:
       for i in range(len(maxs)):
           maxs[i] = max(maxs[i],tok[i])
           mins[i] = min(mins[i],tok[i])
   return example

datasets.map(getmaxmin,
   num_proc=multiprocessing.cpu_count(),
   fn_kwargs={
       'maxs':maxs,
       'mins':mins
})

Hi! We don’t have an official API (currently) for running aggregations directly on the underlying Arrow table, but you can use the experimental datasets_sql package that leverages DuckDB.

So in your case, to get maximum and minimum values, you would run the following queries (I’m assuming that num_channels=6):

# ... dataset initialization
from datasets_sql import query
query_max = query("SELECT MAX(tokens[0]) as max_0, MAX(tokens[1]) as max_1, MAX(tokens[2]) as max_2, MAX(tokens[3]) as max_3, MAX(tokens[4]) as max_4, MAX(tokens[5]) as max_5 FROM (SELECT unnest(tokens) as tokens FROM dataset)")
query_min = query("SELECT MIN(tokens[0]) as min_0, MIN(tokens[1]) as min_1, MIN(tokens[2]) as min_2, MIN(tokens[3]) as min_3, MIN(tokens[4]) as min_4, MIN(tokens[5]) as min_5 FROM (SELECT unnest(tokens) as tokens FROM dataset)")
1 Like