Dataset.map saves list as numpy array instead of as list

Hi! I have been trying to have a function that ouputs a list to map it to my dataset. However, when trying to pass the output to a csv or a DataFrame, it appears as a Numpy array instead of as a list. So, if I have the following function:

    def fetch_embedding(data):
        text = data["text"]
        out = trainer.predict(text)
        embeddings = out[0][1][-1][:,0,:]
        embeddings = embeddings.tolist()
        return {"embeddings" : embeddings}

So, I’m very intentionally passing the torch tensor to a list. Then I map it to the dataset to save these embeddings:

dataset = dataset.map(fetch_embedding)

We check whether the dataset stored the list as an actual list:

df = datset.to_pandas()
A = df.iloc[0].loc["embeddings"]
print(type(A))

The output of this is the following:

<class ‘numpy.ndarray’>

Is there any way to actually have the output of the map saved as a list instead of it being passed as a Numpy array?

Hi ! This behavior comes from Arrow:

>>> import pyarrow as pa
>>> df = pa.table({"embeddings": [[0, 1, 2]]}).to_pandas()
>>> df.iloc[0].loc["embeddings"]
array([0, 1, 2])

Arrow uses numpy arrays to enable fast zero-copy conversions from Arrow to pandas. Converting to a python list requires to copy the data into new pythons objects, which can be costly in some cases.

You can get a python list using

df["embeddings"] = [arr.tolist() for arr in df["embeddings"]]

That makes a lot of sense! Thanks for the answer :slight_smile: