Why are dict objects added to all keys for all records?

Hello,

I want to inform you about weird (for me but I am not sure) case.

Let’s assume that we have a data like this:

abc = [
    {"t": {"a": 1, "b": 2, "c": 3}},
    {"t": {"aa1": 1, "b1": 2, "c": 3}},
    {"t": {"ab": 1, "b": 2, "c23": 3}},
    {"t": {"a": 1, "b": 2, "c": 3, "d": 4}},
]

When I try to convert it to Dataset object, it concats all dict values for all items of data.

test = Dataset.from_list(abc)
test

Dataset({
    features: ['t'],
    num_rows: 4
})
test[0]

{'t': {'a': 1,
  'aa1': None,
  'ab': None,
  'b': 2,
  'b1': None,
  'c': 3,
  'c23': None,
  'd': None}}

Why is it like this? Why couldn’t I use it as originally given format?

Hi,

I assume you want a dataset with 3 columns/features: a, b and c.

In that case, you just need a list of dictionaries as seen here: Load. There’s no need for an additional “t” feature, and I’m not sure Datasets supports hierarchical columns: Setting format of columns for nested dictionary datasets with set_format - #2 by lhoestq. cc @lhoestq

Hi Niels,

That is not my case. I want a dataset with a single feature called “t”. “t” contains dictionaries with no fixed-size items. For example, every record in “t” feature is about the attribute of a product. It can contain very different attributes. For example, for shoes, it can contain color, size, and type, and for fruit, it has to have different attributes. When I use this kind of data with datasets library, it causes the problem mentioned above.

abc = [
    {"t": {"name": "shoes", "color": "black", "size": 44}},
    {"t": {"name":"knitting needle", "size": 11, "length": 2, "material": "iron"}},
    {"t": {"name": "book", "language": "English", "isbn": "1231231"}},
    ...
]

I don’t think Datasets supports a variable number of features per row. You will need to add all the features, and add null values for rows which have a missing value.

1 Like