Problem When Using Datasets to Open JSONL
I am trying to open a JSONL format file using the datasets
library. Here is my code:
from datasets import load_dataset
path = "./testdata.jsonl"
dataset = load_dataset('json', data_files=path, split='train')
The contents of testdata.jsonl are organized as follows (just for testing):
{"src":"hello","term":{"a":"aa"}}
{"src":"hi","term":{"b":"bb"}}
When I use the code above to load the dataset and attempt to print the second item, like this:
print(dataset[1])
I get the following output:
{'src': 'hi', 'term': {'a': None, 'b': 'bb'}}
Instead of the expected output:
{'src': 'hi', 'term': {'b': 'bb'}}
How can I obtain the second format of the dataset? Is it possible that I simply forgot to include a parameter?
1 Like
Ensure the JSONL file is correctly formatted:
Each line in the file should be a valid JSON object with no extra commas or brackets. For example, the file should look like this:
{“src”:“hello”,“term”:{“a”:“aa”}}
{“src”:“hi”,“term”:{“b”:“bb”}}
After fixing the JSONL format, use the following code to load the dataset properly:
from datasets import load_dataset
path = “./testdata.jsonl”
dataset = load_dataset(‘json’, data_files=path, split=‘train’)
print(dataset[1]) # This should now work correctly
After these changes, the second entry should now print the correct data:
{‘src’: ‘hi’, ‘term’: {‘b’: ‘bb’}}
Also, ensure there are no extra spaces or line breaks in the dataset if it’s large. Each line should be a valid JSON object.
Response generated by Triskel Data Deterministic Ai
1 Like
Another option, albeit a bit rough, is this:
from datasets import load_dataset
def process(example):
example["term"] = str({k: v for k, v in example["term"].items() if v is not None})
return example
path = "./testdata.jsonl"
dataset = load_dataset('json', data_files=path, split='train')
print(dataset[1]) # {'src': 'hi', 'term': {'a': None, 'b': 'bb'}}
dataset = dataset.map(process)
print(dataset[1]) # {'src': 'hi', 'term': "{'b': 'bb'}"}
Thank you for your advice. I appreciate your efforts, but unfortunately, it hasn’t been effective for me.
1 Like
Thank you for your advice; it was really helpful in solving the problem! However, I find it a bit cumbersome to map the datasets each time I want to open a JSONL file with JSON elements. I wonder if there might be a more permanent solution to address this issue.
1 Like
I find it a bit cumbersome to map the datasets each time I want to open a JSONL file with JSON elements. I wonder if there might be a more permanent solution to address this issue.
That’s true. There may be a more concise method (including potential ones). I’ll mention it to the library developer. @lhoestq
1 Like
Thank you! I look forward to any official solutions that the developer might provide.
1 Like
Hi ! This behavior is expected since datasets
uses Arrow which has fixed types. This means each sample should have the same subfields with the same types. Missing subfields are filled with None.
You can restructure your data differently to fit this paradigm: either converting nested data as one string, or use one list for keys and one list for values.
1 Like