*** TypeError: Couldn’t cast array of type timestamp[s] to null**
from datasets import load_dataset
issues_dataset = load_dataset(“json”, data_files=“datasets-issues.jsonl”, split=“train”)
issues_dataset
Downloading data files: 100%
1/1 [00:00<00:00, 37.80it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 44.29it/s]
Generating train split:
2584/0 [00:01<00:00, 3104.08 examples/s]
TypeError: Couldn’t cast array of type timestamp[s] to null
The above exception was the direct cause of the following exception:
DatasetGenerationError: An error occurred while generating the dataset
I divided the datasets-issues.jsonl into two files, and find each file can split correctly:
from datasets import load_dataset
issues_dataset_1 = load_dataset("json", data_files="datasets-issues-1.jsonl", split="train")
issues_dataset_1
Dataset({
features: [‘url’, ‘repository_url’, ‘labels_url’, ‘comments_url’, ‘events_url’, ‘html_url’, ‘id’, ‘node_id’, ‘number’, ‘title’, ‘user’, ‘labels’, ‘state’, ‘locked’, ‘assignee’, ‘assignees’, ‘milestone’, ‘comments’, ‘created_at’, ‘updated_at’, ‘closed_at’, ‘author_association’, ‘active_lock_reason’, ‘draft’, ‘pull_request’, ‘body’, ‘reactions’, ‘timeline_url’, ‘performed_via_github_app’, ‘state_reason’],
num_rows: 2884
})
from datasets import load_dataset
issues_dataset_2 = load_dataset("json", data_files="datasets-issues-2.jsonl", split="train")
issues_dataset_2
Dataset({
features: [‘url’, ‘repository_url’, ‘labels_url’, ‘comments_url’, ‘events_url’, ‘html_url’, ‘id’, ‘node_id’, ‘number’, ‘title’, ‘user’, ‘labels’, ‘state’, ‘locked’, ‘assignee’, ‘assignees’, ‘milestone’, ‘comments’, ‘created_at’, ‘updated_at’, ‘closed_at’, ‘author_association’, ‘active_lock_reason’, ‘draft’, ‘pull_request’, ‘body’, ‘reactions’, ‘timeline_url’, ‘performed_via_github_app’, ‘state_reason’],
num_rows: 3624
})
I try to combined issues_dataset_1 and issues_dataset_2 into issues_dataset, but did not succeed.
I decided to use issues_dataset_2 as issues_dataset, since I had waste too much time on this trivial matter