Compatibility for numpy arrays

Is there any native compatibility in datasets to construct it from NumPy arrays to be further used in transformers without writing it to a file and loading it that way?

As far as I know there isn’t a native Dataset.from_numpy method, but you could map your array to a Python dictionary and use the from_dict method: Loading a Dataset — datasets 1.5.0 documentation

1 Like

Native support is always preferable, but I guess making it a dict is a good workaround :wink:
Thanks for your reply!!

1 Like

you can always open an issue on the datasets repo and provide some details on your use case / workflow :slight_smile:

my guess is that numpy arrays are bit tricky compared to pandas dataframes because you need some way of encoding the column name for the arrow format

Hmmm…I didn’t think about the column names :thinking: Maybe automatic inference (like sometimes present in some AutoML tools) could be done, where it would prompt the user of what it feels are the most appropriate names and allows the user to change it as well?

A separate argument that takes a list containing column names?

While all this may just be less-used features, it is more on the side of a beginner who knows numpy and finds that simple arguments are enough to use them with transformers instead of getting him to read the docs on possible workarounds :wink:

i had a quick look at how the from_pandas method works and you can see here that it’s calling InMemoryTable.from_pandas: datasets/ at 3fc744abbef13468fa1f42cf4738e5267234549b · huggingface/datasets · GitHub

interestingly, InMemoryTable has a from_arrays method (link) so it might be not be much work to expose this to the Dataset class - i’d definitely open up an issue for this as a feature request!

that would definitely be cool if it can be implemented!! most people like using NumPy arrays for its simplicity. Integrating into datasets would be nice :partying_face:

1 Like

Feel free to use Dataset.from_dict and pass a dict containing your numpy arrays :wink: