Is there any native compatibility in datasets
to construct it from NumPy arrays to be further used in transformers
without writing it to a file and loading it that way?
As far as I know there isn’t a native Dataset.from_numpy
method, but you could map your array to a Python dictionary and use the from_dict
method: Loading a Dataset — datasets 1.5.0 documentation
Native support is always preferable, but I guess making it a dict
is a good workaround
Thanks for your reply!!
you can always open an issue on the datasets
repo and provide some details on your use case / workflow
my guess is that numpy arrays are bit tricky compared to pandas dataframes because you need some way of encoding the column name for the arrow format
Hmmm…I didn’t think about the column names Maybe automatic inference (like sometimes present in some AutoML tools) could be done, where it would prompt the user of what it feels are the most appropriate names and allows the user to change it as well?
A separate argument that takes a list containing column names?
While all this may just be less-used features, it is more on the side of a beginner who knows numpy
and finds that simple arguments are enough to use them with transformers
instead of getting him to read the docs on possible workarounds
i had a quick look at how the from_pandas
method works and you can see here that it’s calling InMemoryTable.from_pandas
: datasets/arrow_dataset.py at 3fc744abbef13468fa1f42cf4738e5267234549b · huggingface/datasets · GitHub
interestingly, InMemoryTable
has a from_arrays
method (link) so it might be not be much work to expose this to the Dataset
class - i’d definitely open up an issue for this as a feature request!
that would definitely be cool if it can be implemented!! most people like using NumPy arrays for its simplicity. Integrating into datasets
would be nice
Feel free to use Dataset.from_dict and pass a dict containing your numpy arrays