Compatibility for numpy arrays

Neel-Gupta · April 7, 2021, 4:18pm

Is there any native compatibility in datasets to construct it from NumPy arrays to be further used in transformers without writing it to a file and loading it that way?

lewtun · April 8, 2021, 12:12pm

As far as I know there isn’t a native Dataset.from_numpy method, but you could map your array to a Python dictionary and use the from_dict method: Loading a Dataset — datasets 1.5.0 documentation

Neel-Gupta · April 8, 2021, 1:10pm

Native support is always preferable, but I guess making it a dict is a good workaround
Thanks for your reply!!

lewtun · April 8, 2021, 1:17pm

you can always open an issue on the datasets repo and provide some details on your use case / workflow

my guess is that numpy arrays are bit tricky compared to pandas dataframes because you need some way of encoding the column name for the arrow format

Neel-Gupta · April 8, 2021, 2:19pm

Hmmm…I didn’t think about the column names Maybe automatic inference (like sometimes present in some AutoML tools) could be done, where it would prompt the user of what it feels are the most appropriate names and allows the user to change it as well?

A separate argument that takes a list containing column names?

While all this may just be less-used features, it is more on the side of a beginner who knows numpy and finds that simple arguments are enough to use them with transformers instead of getting him to read the docs on possible workarounds

lewtun · April 8, 2021, 3:35pm

i had a quick look at how the from_pandas method works and you can see here that it’s calling InMemoryTable.from_pandas: datasets/arrow_dataset.py at 3fc744abbef13468fa1f42cf4738e5267234549b · huggingface/datasets · GitHub

interestingly, InMemoryTable has a from_arrays method (link) so it might be not be much work to expose this to the Dataset class - i’d definitely open up an issue for this as a feature request!

Neel-Gupta · April 8, 2021, 4:28pm

that would definitely be cool if it can be implemented!! most people like using NumPy arrays for its simplicity. Integrating into datasets would be nice

lhoestq · April 27, 2021, 5:48pm

Feel free to use Dataset.from_dict and pass a dict containing your numpy arrays

Topic		Replies	Views
Iterable datasets for array data, limited formatting options 🤗Datasets	2	435	December 28, 2023
Dataset.map saves list as numpy array instead of as list 🤗Datasets	2	1438	January 3, 2023
Nlp 0.3.0 is out! 🤗Datasets	3	848	July 8, 2020
Add new column to a dataset 🤗Datasets	8	5015	January 18, 2024
Transform a tf.data.dataset to a datasets.dataset? Beginners	3	2393	September 30, 2022

Compatibility for numpy arrays

Related topics