Ensure the JSONL file is correctly formatted:
Each line in the file should be a valid JSON object with no extra commas or brackets. For example, the file should look like this:
from datasets import load_dataset
def process(example):
example["term"] = str({k: v for k, v in example["term"].items() if v is not None})
return example
path = "./testdata.jsonl"
dataset = load_dataset('json', data_files=path, split='train')
print(dataset[1]) # {'src': 'hi', 'term': {'a': None, 'b': 'bb'}}
dataset = dataset.map(process)
print(dataset[1]) # {'src': 'hi', 'term': "{'b': 'bb'}"}
Thank you for your advice; it was really helpful in solving the problem! However, I find it a bit cumbersome to map the datasets each time I want to open a JSONL file with JSON elements. I wonder if there might be a more permanent solution to address this issue.
I find it a bit cumbersome to map the datasets each time I want to open a JSONL file with JSON elements. I wonder if there might be a more permanent solution to address this issue.
That’s true. There may be a more concise method (including potential ones). I’ll mention it to the library developer. @lhoestq
Hi ! This behavior is expected since datasets uses Arrow which has fixed types. This means each sample should have the same subfields with the same types. Missing subfields are filled with None.
You can restructure your data differently to fit this paradigm: either converting nested data as one string, or use one list for keys and one list for values.