Using datasets to open jsonl

bluebingo · June 28, 2025, 6:33pm

Problem When Using Datasets to Open JSONL

I am trying to open a JSONL format file using the datasets library. Here is my code:

from datasets import load_dataset

path = "./testdata.jsonl"
dataset = load_dataset('json', data_files=path, split='train')

The contents of testdata.jsonl are organized as follows (just for testing):

{"src":"hello","term":{"a":"aa"}}
{"src":"hi","term":{"b":"bb"}}

When I use the code above to load the dataset and attempt to print the second item, like this:

print(dataset[1])

I get the following output:

{'src': 'hi', 'term': {'a': None, 'b': 'bb'}}

Instead of the expected output:

{'src': 'hi', 'term': {'b': 'bb'}}

How can I obtain the second format of the dataset? Is it possible that I simply forgot to include a parameter?

Pimpcat-AU · June 28, 2025, 10:47pm

Ensure the JSONL file is correctly formatted:
Each line in the file should be a valid JSON object with no extra commas or brackets. For example, the file should look like this:

{“src”:“hello”,“term”:{“a”:“aa”}}
{“src”:“hi”,“term”:{“b”:“bb”}}

After fixing the JSONL format, use the following code to load the dataset properly:

from datasets import load_dataset

path = “./testdata.jsonl”
dataset = load_dataset(‘json’, data_files=path, split=‘train’)

print(dataset[1]) # This should now work correctly

After these changes, the second entry should now print the correct data:

{‘src’: ‘hi’, ‘term’: {‘b’: ‘bb’}}

Also, ensure there are no extra spaces or line breaks in the dataset if it’s large. Each line should be a valid JSON object.

Response generated by Triskel Data Deterministic Ai

John6666 · June 28, 2025, 10:55pm

Another option, albeit a bit rough, is this:

from datasets import load_dataset

def process(example):
    example["term"] = str({k: v for k, v in example["term"].items() if v is not None})
    return example

path = "./testdata.jsonl"
dataset = load_dataset('json', data_files=path, split='train')

print(dataset[1]) # {'src': 'hi', 'term': {'a': None, 'b': 'bb'}}

dataset = dataset.map(process)

print(dataset[1]) # {'src': 'hi', 'term': "{'b': 'bb'}"}

bluebingo · June 29, 2025, 6:35pm

Thank you for your advice. I appreciate your efforts, but unfortunately, it hasn’t been effective for me.

bluebingo · June 29, 2025, 6:38pm

Thank you for your advice; it was really helpful in solving the problem! However, I find it a bit cumbersome to map the datasets each time I want to open a JSONL file with JSON elements. I wonder if there might be a more permanent solution to address this issue.

John6666 · June 30, 2025, 1:50am

I find it a bit cumbersome to map the datasets each time I want to open a JSONL file with JSON elements. I wonder if there might be a more permanent solution to address this issue.

That’s true. There may be a more concise method (including potential ones). I’ll mention it to the library developer. @lhoestq

bluebingo · June 30, 2025, 8:03am

Thank you! I look forward to any official solutions that the developer might provide.

lhoestq · July 1, 2025, 12:27pm

Hi ! This behavior is expected since datasets uses Arrow which has fixed types. This means each sample should have the same subfields with the same types. Missing subfields are filled with None.

You can restructure your data differently to fit this paradigm: either converting nested data as one string, or use one list for keys and one list for values.

John6666 · July 1, 2025, 8:18pm

Thank you, lhonestq!

bluebingo · July 2, 2025, 1:16am

Thank you, lhonestq!

system · July 2, 2025, 1:17pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem with loading custom dataset from jsonl file Beginners	1	12717	May 5, 2023
Json dump format for load_dataset 🤗Datasets	5	22036	September 5, 2024
Load Dataset Fail for Custom Json Format Beginners	3	8504	June 20, 2023
JSON parse error when load_dataset 🤗Datasets	0	96	August 10, 2024
Error with load model from JSON in datasets 🤗Datasets	2	672	November 25, 2023

Using datasets to open jsonl

Problem When Using Datasets to Open JSONL

Related topics