How to get maximum and minimum value of features?

Hi! We don’t have an official API (currently) for running aggregations directly on the underlying Arrow table, but you can use the experimental datasets_sql package that leverages DuckDB.

So in your case, to get maximum and minimum values, you would run the following queries (I’m assuming that num_channels=6):

# ... dataset initialization
from datasets_sql import query
query_max = query("SELECT MAX(tokens[0]) as max_0, MAX(tokens[1]) as max_1, MAX(tokens[2]) as max_2, MAX(tokens[3]) as max_3, MAX(tokens[4]) as max_4, MAX(tokens[5]) as max_5 FROM (SELECT unnest(tokens) as tokens FROM dataset)")
query_min = query("SELECT MIN(tokens[0]) as min_0, MIN(tokens[1]) as min_1, MIN(tokens[2]) as min_2, MIN(tokens[3]) as min_3, MIN(tokens[4]) as min_4, MIN(tokens[5]) as min_5 FROM (SELECT unnest(tokens) as tokens FROM dataset)")
1 Like