I have a large dataset with a feature of type list which has the structure of [num_steps,num_channels]
. For each channel
, I’d like to get the maximum and minimum value across all steps
and all examples
in the dataset.
I have already tried the following code using map()
but it is not working:
maxs = [-math.inf]*6
mins = [math.inf]*6
def getmaxmin(example,maxs,mins):
for tok in example['tokens']:
for i in range(len(maxs)):
maxs[i] = max(maxs[i],tok[i])
mins[i] = min(mins[i],tok[i])
return example
datasets.map(getmaxmin,
num_proc=multiprocessing.cpu_count(),
fn_kwargs={
'maxs':maxs,
'mins':mins
})
Hi! We don’t have an official API (currently) for running aggregations directly on the underlying Arrow table, but you can use the experimental datasets_sql
package that leverages DuckDB.
So in your case, to get maximum and minimum values, you would run the following queries (I’m assuming that num_channels=6
):
# ... dataset initialization
from datasets_sql import query
query_max = query("SELECT MAX(tokens[0]) as max_0, MAX(tokens[1]) as max_1, MAX(tokens[2]) as max_2, MAX(tokens[3]) as max_3, MAX(tokens[4]) as max_4, MAX(tokens[5]) as max_5 FROM (SELECT unnest(tokens) as tokens FROM dataset)")
query_min = query("SELECT MIN(tokens[0]) as min_0, MIN(tokens[1]) as min_1, MIN(tokens[2]) as min_2, MIN(tokens[3]) as min_3, MIN(tokens[4]) as min_4, MIN(tokens[5]) as min_5 FROM (SELECT unnest(tokens) as tokens FROM dataset)")
1 Like