Datasets: 2.1.0
OS: Ubuntu 20.04.4
Python: 3.9.7
I have the following code:
import numpy as np
import pandas as pd
from datasets import Dataset
from scipy.stats import zscore
df = pd.DataFrame(columns=['m1', 'm2'])
for ii in range(5):
matrix1 = np.random.rand(4)
matrix2 = np.random.rand(6)
df = df.append({'m1': matrix1, 'm2': matrix2}, ignore_index=True)
dataset = Dataset.from_pandas(df)
def preprocess(example):
normalized_col = zscore(example['m1']).tolist()
example['_'.join(['m1', 'normalized'])] = normalized_col
dataset = dataset.add_item({'_'.join(['m1', 'normalized']): None})
dataset = dataset.map(preprocess)
When I run this code, I get the following error:
**Exception has occurred: AxisError
axis 0 is out of bounds for array of dimension 0
File "/home/aclifton/rf_fp/test_run.py", line 35, in preprocess
normalized_col = zscore(example['m1']).tolist()
File "/home/aclifton/rf_fp/test_run.py", line 40, in <module>
dataset = dataset.map(preprocess)**
from what I can tell, using dataset.add_item()
changes num_rows
in the datasets
object from 5 to 6. I’m not sure the best way to correct this. Basically I want to take a feature of the datasets
object, perform some calculation on it, then add that transformed array as a new feature to the already existing datasets
object without changing the number of rows. Any thoughts on how best to do this?
Thanks in advance for your help!!