Using datasets to open jsonl

Problem When Using Datasets to Open JSONL

I am trying to open a JSONL format file using the datasets library. Here is my code:

from datasets import load_dataset

path = "./testdata.jsonl"
dataset = load_dataset('json', data_files=path, split='train')

The contents of testdata.jsonl are organized as follows (just for testing):

{"src":"hello","term":{"a":"aa"}}
{"src":"hi","term":{"b":"bb"}}

When I use the code above to load the dataset and attempt to print the second item, like this:

print(dataset[1])

I get the following output:

{'src': 'hi', 'term': {'a': None, 'b': 'bb'}}

Instead of the expected output:

{'src': 'hi', 'term': {'b': 'bb'}}

How can I obtain the second format of the dataset? Is it possible that I simply forgot to include a parameter?

1 Like

Ensure the JSONL file is correctly formatted:
Each line in the file should be a valid JSON object with no extra commas or brackets. For example, the file should look like this:

{“src”:“hello”,“term”:{“a”:“aa”}}
{“src”:“hi”,“term”:{“b”:“bb”}}

After fixing the JSONL format, use the following code to load the dataset properly:

from datasets import load_dataset

path = “./testdata.jsonl”
dataset = load_dataset(‘json’, data_files=path, split=‘train’)

print(dataset[1]) # This should now work correctly

After these changes, the second entry should now print the correct data:

{‘src’: ‘hi’, ‘term’: {‘b’: ‘bb’}}

Also, ensure there are no extra spaces or line breaks in the dataset if it’s large. Each line should be a valid JSON object.

Response generated by Triskel Data Deterministic Ai

1 Like

Another option, albeit a bit rough, is this:

from datasets import load_dataset

def process(example):
    example["term"] = str({k: v for k, v in example["term"].items() if v is not None})
    return example

path = "./testdata.jsonl"
dataset = load_dataset('json', data_files=path, split='train')

print(dataset[1]) # {'src': 'hi', 'term': {'a': None, 'b': 'bb'}}

dataset = dataset.map(process)

print(dataset[1]) # {'src': 'hi', 'term': "{'b': 'bb'}"}

Thank you for your advice. I appreciate your efforts, but unfortunately, it hasn’t been effective for me.

1 Like

Thank you for your advice; it was really helpful in solving the problem! However, I find it a bit cumbersome to map the datasets each time I want to open a JSONL file with JSON elements. I wonder if there might be a more permanent solution to address this issue.

1 Like

(post deleted by author)

Thank you for your assistance! It has to mention that this method has been put forward by @John6666 and I am curious if there might be a more long-term solution to address this issue.

1 Like

I find it a bit cumbersome to map the datasets each time I want to open a JSONL file with JSON elements. I wonder if there might be a more permanent solution to address this issue.

That’s true. There may be a more concise method (including potential ones). I’ll mention it to the library developer. @lhoestq