Union types, or features that can be multiple types

Hi everyone, I’m very new to all this and working with this kind of data, and I don’t know if the data we’ve been using for our own model can be converted to a Hugging Face dataset based on how it’s structured.

If you run these two Python lines:

df = pd.DataFrame({'a': [1,2,3,4,'v',6,7]})
dataset = Dataset.from_pandas(df)

you get an error because the values in the dataframe column are not all the same type: there is one string or char mixed in with all the ints.

I’m working with data composed of tables from spreadsheets, where each column is represented as a JSON object of the form {"Name": <string>, "Values": <array of string and numbers>}, where Name is the header of the table column (in one cell) and Values represents the column of cells below it. The whole table is stored as an array of these.

In the dataset I’m trying to create, this table would be a feature, part of a larger set of features (along with other things such as the ID and a descriptive string). But can I create a feature for this, when I have to include a type, even for things like Array2D, when one of these objects can hold both strings and numbers in its Values list?

Even more complicated, there is another feature we use, which can be either a number, a string, one of these tables (array of those column objects), or an array of tables (so an array of arrays of column objects). Is it possible to have that be a feature when it has so many possibilities for its type? A union type would be really nice, but I don’t think it’s possible, or at least I couldn’t find it in the documentation.

Thank you!

Hi! The union type is currently not supported in datasets, but it is in PyArrow, which is the underlying format datasets relies on, so feel free to create a feature request in our GH repo. And in the meantime, you can use structs with separate fields for each type.

1 Like