Why are dict objects added to all keys for all records?

yusufcakmak · May 3, 2024, 12:41pm

Hello,

I want to inform you about weird (for me but I am not sure) case.

Let’s assume that we have a data like this:

abc = [
    {"t": {"a": 1, "b": 2, "c": 3}},
    {"t": {"aa1": 1, "b1": 2, "c": 3}},
    {"t": {"ab": 1, "b": 2, "c23": 3}},
    {"t": {"a": 1, "b": 2, "c": 3, "d": 4}},
]

When I try to convert it to Dataset object, it concats all dict values for all items of data.

test = Dataset.from_list(abc)
test

Dataset({
    features: ['t'],
    num_rows: 4
})

test[0]

{'t': {'a': 1,
  'aa1': None,
  'ab': None,
  'b': 2,
  'b1': None,
  'c': 3,
  'c23': None,
  'd': None}}

Why is it like this? Why couldn’t I use it as originally given format?

nielsr · May 6, 2024, 10:12am

Hi,

I assume you want a dataset with 3 columns/features: a, b and c.

In that case, you just need a list of dictionaries as seen here: Load. There’s no need for an additional “t” feature, and I’m not sure Datasets supports hierarchical columns: Setting format of columns for nested dictionary datasets with set_format - #2 by lhoestq. cc @lhoestq

yusufcakmak · May 6, 2024, 11:51am

Hi Niels,

That is not my case. I want a dataset with a single feature called “t”. “t” contains dictionaries with no fixed-size items. For example, every record in “t” feature is about the attribute of a product. It can contain very different attributes. For example, for shoes, it can contain color, size, and type, and for fruit, it has to have different attributes. When I use this kind of data with datasets library, it causes the problem mentioned above.

abc = [
    {"t": {"name": "shoes", "color": "black", "size": 44}},
    {"t": {"name":"knitting needle", "size": 11, "length": 2, "material": "iron"}},
    {"t": {"name": "book", "language": "English", "isbn": "1231231"}},
    ...
]

nielsr · May 6, 2024, 12:45pm

I don’t think Datasets supports a variable number of features per row. You will need to add all the features, and add null values for rows which have a missing value.

Topic		Replies	Views
Efficient way to concatenate DatasetDict objects 🤗Datasets	1	2325	June 12, 2023
Convert a list of dictionaries to hugging face dataset object 🤗Datasets	4	19536	December 7, 2023
How to Use a Nested Python Dictionary in Dataset.from_dict Beginners	6	6393	April 27, 2021
Avoid standardizing keys for feature values which are a list of dictionaries 🤗Datasets	2	310	June 7, 2024
Adding data to empty dataset object 🤗Datasets	3	3473	February 10, 2022

Why are dict objects added to all keys for all records?

Related topics