Using datasets to open jsonl

Problem When Using Datasets to Open JSONL

I am trying to open a JSONL format file using the datasets library. Here is my code:

from datasets import load_dataset

path = "./testdata.jsonl"
dataset = load_dataset('json', data_files=path, split='train')

The contents of testdata.jsonl are organized as follows (just for testing):

{"src":"hello","term":{"a":"aa"}}
{"src":"hi","term":{"b":"bb"}}

When I use the code above to load the dataset and attempt to print the second item, like this:

print(dataset[1])

I get the following output:

{'src': 'hi', 'term': {'a': None, 'b': 'bb'}}

Instead of the expected output:

{'src': 'hi', 'term': {'b': 'bb'}}

How can I obtain the second format of the dataset? Is it possible that I simply forgot to include a parameter?

1 Like

Ensure the JSONL file is correctly formatted:
Each line in the file should be a valid JSON object with no extra commas or brackets. For example, the file should look like this:

{“src”:“hello”,“term”:{“a”:“aa”}}
{“src”:“hi”,“term”:{“b”:“bb”}}

After fixing the JSONL format, use the following code to load the dataset properly:

from datasets import load_dataset

path = “./testdata.jsonl”
dataset = load_dataset(‘json’, data_files=path, split=‘train’)

print(dataset[1]) # This should now work correctly

After these changes, the second entry should now print the correct data:

{‘src’: ‘hi’, ‘term’: {‘b’: ‘bb’}}

Also, ensure there are no extra spaces or line breaks in the dataset if it’s large. Each line should be a valid JSON object.

Response generated by Triskel Data Deterministic Ai

1 Like

Another option, albeit a bit rough, is this:

from datasets import load_dataset

def process(example):
    example["term"] = str({k: v for k, v in example["term"].items() if v is not None})
    return example

path = "./testdata.jsonl"
dataset = load_dataset('json', data_files=path, split='train')

print(dataset[1]) # {'src': 'hi', 'term': {'a': None, 'b': 'bb'}}

dataset = dataset.map(process)

print(dataset[1]) # {'src': 'hi', 'term': "{'b': 'bb'}"}

Thank you for your advice. I appreciate your efforts, but unfortunately, it hasn’t been effective for me.

1 Like

Thank you for your advice; it was really helpful in solving the problem! However, I find it a bit cumbersome to map the datasets each time I want to open a JSONL file with JSON elements. I wonder if there might be a more permanent solution to address this issue.

1 Like

I find it a bit cumbersome to map the datasets each time I want to open a JSONL file with JSON elements. I wonder if there might be a more permanent solution to address this issue.

That’s true. There may be a more concise method (including potential ones). I’ll mention it to the library developer. @lhoestq

1 Like

Thank you! I look forward to any official solutions that the developer might provide.

1 Like

Hi ! This behavior is expected since datasets uses Arrow which has fixed types. This means each sample should have the same subfields with the same types. Missing subfields are filled with None.

You can restructure your data differently to fit this paradigm: either converting nested data as one string, or use one list for keys and one list for values.

1 Like

Thank you, lhonestq!