Hi,
I’m trying to add a new dataset to the repo, that has a field that consist of a dictionary, where the key name are not predefined, and the value is another dictionary, whose keys are pre defined.
For instance:
"nps": {
"a": {"id": 0, "text": "text1"},
"b": {"id": 1, "text": "text2"},
"c": {"id": 2, "text": "text3"},
}
As such, another instance may contain different keys (e.g. just “a”, “b”).
I’d like to specify this structure for the _info
function, but I didn’t find another dataset who does this, to take inspiration from.
Any pointers / code example would be appreciated!
Hi ! Every type is nullable, therefore you can have feature types defined as
Features({
"nps": {
"a": {"id": Value("int32"), "text": Value("text")},
"b": {"id": Value("int32"), "text": Value("text")},
"c": {"id": Value("int32"), "text": Value("text")},
}
})
and contain samples that have None
for certain fields:
sample1 = {"nps": {
"a": {"id": 0, "text": "text1"},
"b": {"id": 1, "text": "text2"},
"c": {"id": 2, "text": "text3"},
}}
sample2 = {"nps": {
"a": None,
"b": {"id": 1, "text": "text2"},
"c": {"id": 2, "text": "text3"},
}}
Also, if the main field name is not known in advance, you may name the content of the field “content” and have an extra field “content_name” that contains the name that is not predefined.
sample1 = {
"content_name": "nps",
"content": {
"a": {"id": 0, "text": "text1"},
"b": {"id": 1, "text": "text2"},
"c": {"id": 2, "text": "text3"},
}
}
Thanks for the answer! But I’m afraid your answer describes a different scenario than mine.
It’s not that the inner dict may differ, it’s the key names that are not consistent across samples.
For instance, I might have the following samples:
sample1 = {"nps": {
"a": {"id": 0, "text": "text1"},
"b": {"id": 1, "text": "text2"},
}}
sample2 = {"nps": {
"a": {"id": 0, "text": "text1"},
"b": {"id": 1, "text": "text2"},
"c": {"id": 2, "text": "text3"},
}}
sample3 = {"nps": {
"a": {"id": 0, "text": "text1"},
"b": {"id": 1, "text": "text2"},
"c": {"id": 2, "text": "text3"},
"d": {"id": 3, "text": "text4"},
}}
Ok I see, in this case you may have to format your samples this way:
sample1 = {"nps": {
"keys": ["a", "b"],
"values": [{"id": 0, "text": "text1"}, {"id": 1, "text": "text2"}]
}}
Do you mean that I need to change the data format?
Following up our discussion, @lhoestq confirmed that this data structure indeed is not supported, and this would require a change to the original data.
@lhoestq also suggested that adding a JSON feature to the library may solve this issue.
I opened an issue, and will followup here in case this is resolved.
Thanks @lhoestq!
1 Like