Hello,
I want to inform you about weird (for me but I am not sure) case.
Let’s assume that we have a data like this:
abc = [
{"t": {"a": 1, "b": 2, "c": 3}},
{"t": {"aa1": 1, "b1": 2, "c": 3}},
{"t": {"ab": 1, "b": 2, "c23": 3}},
{"t": {"a": 1, "b": 2, "c": 3, "d": 4}},
]
When I try to convert it to Dataset object, it concats all dict values for all items of data.
test = Dataset.from_list(abc)
test
Dataset({
features: ['t'],
num_rows: 4
})
test[0]
{'t': {'a': 1,
'aa1': None,
'ab': None,
'b': 2,
'b1': None,
'c': 3,
'c23': None,
'd': None}}
Why is it like this? Why couldn’t I use it as originally given format?
Hi,
I assume you want a dataset with 3 columns/features: a, b and c.
In that case, you just need a list of dictionaries as seen here: Load. There’s no need for an additional “t” feature, and I’m not sure Datasets supports hierarchical columns: Setting format of columns for nested dictionary datasets with set_format - #2 by lhoestq. cc @lhoestq
Hi Niels,
That is not my case. I want a dataset with a single feature called “t”. “t” contains dictionaries with no fixed-size items. For example, every record in “t” feature is about the attribute of a product. It can contain very different attributes. For example, for shoes, it can contain color, size, and type, and for fruit, it has to have different attributes. When I use this kind of data with datasets library, it causes the problem mentioned above.
abc = [
{"t": {"name": "shoes", "color": "black", "size": 44}},
{"t": {"name":"knitting needle", "size": 11, "length": 2, "material": "iron"}},
{"t": {"name": "book", "language": "English", "isbn": "1231231"}},
...
]
I don’t think Datasets supports a variable number of features per row. You will need to add all the features, and add null
values for rows which have a missing value.
1 Like