Representing nested dictionary with different keys

yanaiela · April 4, 2022, 8:11am

Hi,

I’m trying to add a new dataset to the repo, that has a field that consist of a dictionary, where the key name are not predefined, and the value is another dictionary, whose keys are pre defined.
For instance:

"nps": {
  "a": {"id": 0, "text": "text1"},
  "b": {"id": 1, "text": "text2"},
  "c": {"id": 2, "text": "text3"},
}

As such, another instance may contain different keys (e.g. just “a”, “b”).

I’d like to specify this structure for the _info function, but I didn’t find another dataset who does this, to take inspiration from.

Any pointers / code example would be appreciated!

lhoestq · April 7, 2022, 9:55am

Hi ! Every type is nullable, therefore you can have feature types defined as

Features({
    "nps": {
        "a": {"id": Value("int32"), "text": Value("text")},
        "b": {"id": Value("int32"), "text": Value("text")},
        "c": {"id": Value("int32"), "text": Value("text")},
    }
})

and contain samples that have None for certain fields:

sample1 = {"nps": {
  "a": {"id": 0, "text": "text1"},
  "b": {"id": 1, "text": "text2"},
  "c": {"id": 2, "text": "text3"},
}}
sample2 = {"nps": {
  "a": None,
  "b": {"id": 1, "text": "text2"},
  "c": {"id": 2, "text": "text3"},
}}

Also, if the main field name is not known in advance, you may name the content of the field “content” and have an extra field “content_name” that contains the name that is not predefined.

sample1 = {
  "content_name": "nps",
  "content": {
    "a": {"id": 0, "text": "text1"},
    "b": {"id": 1, "text": "text2"},
    "c": {"id": 2, "text": "text3"},
  }
}

yanaiela · April 7, 2022, 10:15am

Thanks for the answer! But I’m afraid your answer describes a different scenario than mine.

It’s not that the inner dict may differ, it’s the key names that are not consistent across samples.
For instance, I might have the following samples:

sample1 = {"nps": {
  "a": {"id": 0, "text": "text1"},
  "b": {"id": 1, "text": "text2"},
}}
sample2 = {"nps": {
  "a": {"id": 0, "text": "text1"},
  "b": {"id": 1, "text": "text2"},
  "c": {"id": 2, "text": "text3"},
}}
sample3 = {"nps": {
  "a": {"id": 0, "text": "text1"},
  "b": {"id": 1, "text": "text2"},
  "c": {"id": 2, "text": "text3"},
  "d": {"id": 3, "text": "text4"},
}}

lhoestq · April 7, 2022, 10:26am

Ok I see, in this case you may have to format your samples this way:

sample1 = {"nps": {
  "keys": ["a", "b"],
  "values": [{"id": 0, "text": "text1"}, {"id": 1, "text": "text2"}]
}}

yanaiela · April 7, 2022, 10:33am

Do you mean that I need to change the data format?

yanaiela · April 7, 2022, 11:09am

Following up our discussion, @lhoestq confirmed that this data structure indeed is not supported, and this would require a change to the original data.
@lhoestq also suggested that adding a JSON feature to the library may solve this issue.
I opened an issue, and will followup here in case this is resolved.

Thanks @lhoestq!

Topic		Replies	Views
How to Use a Nested Python Dictionary in Dataset.from_dict Beginners	6	6392	April 27, 2021
Avoid standardizing keys for feature values which are a list of dictionaries 🤗Datasets	2	310	June 7, 2024
Answer column not dictionary it is string when load csv using load_dataset 🤗Datasets	1	318	May 2, 2023
Why are dict objects added to all keys for all records? 🤗Datasets	3	162	May 6, 2024
Problems with Dataset.from_dict() and Feature types 🤗Datasets	1	2200	September 6, 2021

Representing nested dictionary with different keys

Related topics