Problem When Using Datasets to Open JSONL
I am trying to open a JSONL format file using the datasets
library. Here is my code:
from datasets import load_dataset
path = "./testdata.jsonl"
dataset = load_dataset('json', data_files=path, split='train')
The contents of testdata.jsonl are organized as follows (just for testing):
{"src":"hello","term":{"a":"aa"}}
{"src":"hi","term":{"b":"bb"}}
When I use the code above to load the dataset and attempt to print the second item, like this:
print(dataset[1])
I get the following output:
{'src': 'hi', 'term': {'a': None, 'b': 'bb'}}
Instead of the expected output:
{'src': 'hi', 'term': {'b': 'bb'}}
How can I obtain the second format of the dataset? Is it possible that I simply forgot to include a parameter?
1 Like
Ensure the JSONL file is correctly formatted:
Each line in the file should be a valid JSON object with no extra commas or brackets. For example, the file should look like this:
{“src”:“hello”,“term”:{“a”:“aa”}}
{“src”:“hi”,“term”:{“b”:“bb”}}
After fixing the JSONL format, use the following code to load the dataset properly:
from datasets import load_dataset
path = “./testdata.jsonl”
dataset = load_dataset(‘json’, data_files=path, split=‘train’)
print(dataset[1]) # This should now work correctly
After these changes, the second entry should now print the correct data:
{‘src’: ‘hi’, ‘term’: {‘b’: ‘bb’}}
Also, ensure there are no extra spaces or line breaks in the dataset if it’s large. Each line should be a valid JSON object.
Response generated by Triskel Data Deterministic Ai
1 Like
Another option, albeit a bit rough, is this:
from datasets import load_dataset
def process(example):
example["term"] = str({k: v for k, v in example["term"].items() if v is not None})
return example
path = "./testdata.jsonl"
dataset = load_dataset('json', data_files=path, split='train')
print(dataset[1]) # {'src': 'hi', 'term': {'a': None, 'b': 'bb'}}
dataset = dataset.map(process)
print(dataset[1]) # {'src': 'hi', 'term': "{'b': 'bb'}"}
Thank you for your advice. I appreciate your efforts, but unfortunately, it hasn’t been effective for me.
1 Like
Thank you for your advice; it was really helpful in solving the problem! However, I find it a bit cumbersome to map the datasets each time I want to open a JSONL file with JSON elements. I wonder if there might be a more permanent solution to address this issue.
1 Like
Thank you for your assistance! It has to mention that this method has been put forward by @John6666 and I am curious if there might be a more long-term solution to address this issue.
1 Like
I find it a bit cumbersome to map the datasets each time I want to open a JSONL file with JSON elements. I wonder if there might be a more permanent solution to address this issue.
That’s true. There may be a more concise method (including potential ones). I’ll mention it to the library developer. @lhoestq