Representing nested dictionary with different keys

Hi,

I’m trying to add a new dataset to the repo, that has a field that consist of a dictionary, where the key name are not predefined, and the value is another dictionary, whose keys are pre defined.
For instance:

"nps": {
  "a": {"id": 0, "text": "text1"},
  "b": {"id": 1, "text": "text2"},
  "c": {"id": 2, "text": "text3"},
}

As such, another instance may contain different keys (e.g. just “a”, “b”).

I’d like to specify this structure for the _info function, but I didn’t find another dataset who does this, to take inspiration from.

Any pointers / code example would be appreciated!

Hi ! Every type is nullable, therefore you can have feature types defined as

Features({
    "nps": {
        "a": {"id": Value("int32"), "text": Value("text")},
        "b": {"id": Value("int32"), "text": Value("text")},
        "c": {"id": Value("int32"), "text": Value("text")},
    }
})

and contain samples that have None for certain fields:

sample1 = {"nps": {
  "a": {"id": 0, "text": "text1"},
  "b": {"id": 1, "text": "text2"},
  "c": {"id": 2, "text": "text3"},
}}
sample2 = {"nps": {
  "a": None,
  "b": {"id": 1, "text": "text2"},
  "c": {"id": 2, "text": "text3"},
}}

Also, if the main field name is not known in advance, you may name the content of the field “content” and have an extra field “content_name” that contains the name that is not predefined.

sample1 = {
  "content_name": "nps",
  "content": {
    "a": {"id": 0, "text": "text1"},
    "b": {"id": 1, "text": "text2"},
    "c": {"id": 2, "text": "text3"},
  }
}

Thanks for the answer! But I’m afraid your answer describes a different scenario than mine.

It’s not that the inner dict may differ, it’s the key names that are not consistent across samples.
For instance, I might have the following samples:

sample1 = {"nps": {
  "a": {"id": 0, "text": "text1"},
  "b": {"id": 1, "text": "text2"},
}}
sample2 = {"nps": {
  "a": {"id": 0, "text": "text1"},
  "b": {"id": 1, "text": "text2"},
  "c": {"id": 2, "text": "text3"},
}}
sample3 = {"nps": {
  "a": {"id": 0, "text": "text1"},
  "b": {"id": 1, "text": "text2"},
  "c": {"id": 2, "text": "text3"},
  "d": {"id": 3, "text": "text4"},
}}

Ok I see, in this case you may have to format your samples this way:

sample1 = {"nps": {
  "keys": ["a", "b"],
  "values": [{"id": 0, "text": "text1"}, {"id": 1, "text": "text2"}]
}}

Do you mean that I need to change the data format?

Following up our discussion, @lhoestq confirmed that this data structure indeed is not supported, and this would require a change to the original data.
@lhoestq also suggested that adding a JSON feature to the library may solve this issue.
I opened an issue, and will followup here in case this is resolved.

Thanks @lhoestq!

1 Like