Another issue related to the load_dataset function which loads the file created by the fetch_issues function in the “Create your own database” course section.
issues_dataset = load_dataset("json", data_files="datasets-issues.jsonl", split="train")
When the above line is run an error is produced
...
TypeError: Couldn't cast array of type timestamp[s] to null
The above exception was the direct cause of the following exception:
...
DatasetGenerationError: An error occurred while generating the dataset
I seems that maybe the response JSON structure GitHub returns has significantly changed since the course was written. It now seems to contain date properties both at the root and in nested objects. In python terms nested dict objects.
As other posters have mentioned the load_dataset function errors when encountering a null or None value in a property previously determined to be a date type. I updated iotengtr solution since there seem to be more timestamps now.
Created a function that can be called recursively to handle nested dicts.
def dateFixer(inDict):
""" Replaces null or None date properties with a default date
Args:
inDict (dict): The dict to clean up
"""
default_date = "1970-01-01T00:00:00Z"
# Iterate over the keys of the dict object
for key in inDict:
# if that value type is a dict then make a recursive call
if isinstance(inDict[key], dict):
dateFixer(inDict[key])
else:
# Timestamps seem to have "_at" in the name such as "created_at" or "_on" -> "due_on"
if ("_at" in key or "_on" in key) and inDict[key] is None:
#print(f"{issue_count} - Replacing {key}: {issue[key]} with {default_date}")
inDict[key]= default_date
Paste this code into the fetch_issues function before the line creating the Pandas dataframe
# Replace missing timestamp values with a default value
for issue in all_issues:
dateFixer(issue)
Another Issue - Hit the 3 consecutive replies limit.
Unfortunately, I am working on this now and can’t come back later after someone else posts. So stacking up here hoping it helps someone in the future.
In section “Creating Your Own Database” under “Augmenting the dataset” getting another error calling the “get_comments” function.
TypeError: string indices must be integers
The issue seems to be the following line. We are trying to index r with a string assuming in is a dict object. However, some of the objects returned by the thousands of API calls must not be a dict.
return [r["body"] for r in response.json()]
To solve I updated the get_comments function with some type checking.
def get_comments(issue_number):
"""Returns a list of comments from the GitHub API for the given issue.
Args:
issue_number (int): The id of the issue to insert into the API URL
Returns:
list: A list of strings representing the comments for the given issue.
"""
url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
response = requests.get(url, headers=headers)
# response.json() returns a list object of dict objects
# The following line assumes r is a dict object, but there is a TypeError being returned
# return [r["body"] for r in response.json()]
# Which might indicate that for some responses r is a string
# We are performing 1,000s of API calls here need to have some type checking
response_json = response.json()
issue_comments = []
if isinstance(response_json, list):
for r in response_json:
# more type checking and ensure body property is present
if isinstance(r, dict) and "body" in r:
issue_comments.append(r["body"])
elif isinstance(r, str):
print(f"Found str not dict: {r}")
else:
print(f"Found unknown not dict: {r}")
elif isinstance(response_json, str):
print(f"Found str not list: {response_json}")
else:
print(f"Found unknown not list: {response_json}")
return issue_comments
Running this function revealed a GitHub rate limit issue because there are over 7,000 issues in the dataset now and we are performing an individual call to get the comments for each one. After the rate limit the API returns a dict object with the 403 message rather than a list object with data. This also creates a TypeError.
One factor that may be creating the “it suddenly works after I come back to it later” effect is that if you are running through this section in less than one hour, you use up the rate limit in the earlier steps. Then in the later steps we hit our limit.
So, I updated the fetch_issues function call to override the num_issues parameter with a default of 10,000 to 2,000. This should get some of the data but also allow head room to run the later steps which also hit the GitHub API
# Depending on your internet connection, this can take several minutes to run...
all_issues = fetch_issues(num_issues=2_000)