Avoid standardizing keys for feature values which are a list of dictionaries

Hi, i’m trying to create a HF dataset from a list using Dataset.from_list.

Each sample in the list is a dict with the same keys (which will be my features). The values for each feature are a list of dictionaries, and each such dictionary has a different set of keys. However, the datasets library standardizes all dictionaries under a feature and adds all possible keys (with None value) from all the dictionaries under that feature.

How can I keep the same set of keys as in the original list for each dictionary under a feature?

Here’s a simple example:

from datasets import Dataset

# Define a function to generate a sample with "tools" feature
def generate_sample():
    # Generate random sample data
    sample_data = {
        "text": "Sample text",
        "feature_1": []
    }
    
    # Add feature_1 with random keys for this sample
    feature_1 = [{"key1": "value1"}, {"key2": "value2"}]  # Example feature_1 with random keys
    sample_data["feature_1"].extend(feature_1)
    
    return sample_data

# Generate multiple samples
num_samples = 10
samples = [generate_sample() for _ in range(num_samples)]

# Create a Hugging Face Dataset
dataset = Dataset.from_list(samples)
dataset[0]

The output is

{'text': 'Sample text', 'feature_1': [{'key1': 'value1', 'key2': None}, {'key1': None, 'key2': 'value2'}]}

Instead, I want to construct the dataset such that I get this
{'text': 'Sample text', 'feature_1': [{'key1': 'value1'}, {'key2': 'value2'}]}

1 Like

It is also my question, and I am required to avoid changing the keys of the dict in each sample.

Hi ! Since datasets uses the columnar Arrow format under the hood, all the samples in the dataset are required to have the same fields and (nullable) types.

A workaround could be to separate the keys and values though:

[{'key1': 'value1'}, {'key2': 'value2'}]

to

[{"keys": ["key1"], "values": ["value1"]}, {"keys": ["key2"], "values": ["value2"]}]

which can be reconstructed using

sample_data["feature_1"] = [
    dict(zip(item["keys"], item["values"]))
    for item in feature_1
]