Hi, i’m trying to create a HF dataset from a list using Dataset.from_list
.
Each sample in the list is a dict with the same keys (which will be my features). The values for each feature are a list of dictionaries, and each such dictionary has a different set of keys. However, the datasets library standardizes all dictionaries under a feature and adds all possible keys (with None value) from all the dictionaries under that feature.
How can I keep the same set of keys as in the original list for each dictionary under a feature?
Here’s a simple example:
from datasets import Dataset
# Define a function to generate a sample with "tools" feature
def generate_sample():
# Generate random sample data
sample_data = {
"text": "Sample text",
"feature_1": []
}
# Add feature_1 with random keys for this sample
feature_1 = [{"key1": "value1"}, {"key2": "value2"}] # Example feature_1 with random keys
sample_data["feature_1"].extend(feature_1)
return sample_data
# Generate multiple samples
num_samples = 10
samples = [generate_sample() for _ in range(num_samples)]
# Create a Hugging Face Dataset
dataset = Dataset.from_list(samples)
dataset[0]
The output is
{'text': 'Sample text', 'feature_1': [{'key1': 'value1', 'key2': None}, {'key1': None, 'key2': 'value2'}]}
Instead, I want to construct the dataset such that I get this
{'text': 'Sample text', 'feature_1': [{'key1': 'value1'}, {'key2': 'value2'}]}