Add new feature without changing number or rows

Datasets: 2.1.0
OS: Ubuntu 20.04.4
Python: 3.9.7

I have the following code:

import numpy as np
import pandas as pd
from datasets import Dataset
from scipy.stats import zscore

df = pd.DataFrame(columns=['m1', 'm2'])
for ii in range(5):
    matrix1 = np.random.rand(4)
    matrix2 = np.random.rand(6)
    df = df.append({'m1': matrix1, 'm2': matrix2}, ignore_index=True)

dataset = Dataset.from_pandas(df)


def preprocess(example):
    normalized_col = zscore(example['m1']).tolist()
    example['_'.join(['m1', 'normalized'])] =  normalized_col

            
dataset = dataset.add_item({'_'.join(['m1', 'normalized']): None})
dataset = dataset.map(preprocess)

When I run this code, I get the following error:

**Exception has occurred: AxisError
axis 0 is out of bounds for array of dimension 0
  File "/home/aclifton/rf_fp/test_run.py", line 35, in preprocess
    normalized_col = zscore(example['m1']).tolist()
  File "/home/aclifton/rf_fp/test_run.py", line 40, in <module>
    dataset = dataset.map(preprocess)**

from what I can tell, using dataset.add_item() changes num_rows in the datasets object from 5 to 6. I’m not sure the best way to correct this. Basically I want to take a feature of the datasets object, perform some calculation on it, then add that transformed array as a new feature to the already existing datasets object without changing the number of rows. Any thoughts on how best to do this?

Thanks in advance for your help!!

Hi ! add_item adds a new row. However what you want is add a new column right ?

To do so, you can use map to perform some calculation and create a new column:


def preprocess(example):
    normalized_col = zscore(example['m1']).tolist()
    return {'m1_normalized': normalized_col}

dataset = Dataset.from_pandas(df).map(preprocess)

map “updates” the dataset, so it’s goign to keep the other columns and add the m1_normalized column

@lhoestq Yep that’s exactly what I needed. Thank you!!