I got exactly that, right now, in 15/03/2024. Pity it hasn’t been fixed
You may want to try this:
from datasets import load_dataset
pubmed_dataset = load_dataset("hwang2006/PUBMED_title_abstracts_2020_baseline", split="train")
pubmed_dataset
That’s great, thanks for sharing this!
In the section Semantic search with FAISS, our checkpoint model already gives a pooler_output
as one of its output along with last_hidden_state
,
But still we created our own function to get the pool embdeddings. Is it different from the above??
i’ve reuploaded on HF the dataset PUBMED_title_abstracts_2019_baseline used for the course Big data? 🤗 Datasets to the rescue! - Hugging Face NLP Course in case it helps students get the exact same setup
Alternatively you can use the following code (only the url changes from the course)
from datasets import load_dataset, DownloadConfig
data_files = "https://huggingface.co/datasets/casinca/PUBMED_title_abstracts_2019_baseline/resolve/main/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset(
"json",
data_files=data_files,
split="train",
download_config=DownloadConfig(delete_extracted=True), # (optional arg)using DownloadConfig to save HD space
)
a brilliant fix! thanks for this!
I got an import error after executing below code in “Semantice search with FAISS” part
embeddings_dataset.add_faiss_index(column=“embeddings”)
It always show below error even though I’ve installed faiss-cpu & faiss-gpu
“ImportError: You must install Faiss to use FaissIndex. To do so you can run conda install -c pytorch faiss-cpu
or conda install -c pytorch faiss-gpu
. A community supported package is also available on pypi: pip install faiss-cpu
or pip install faiss-gpu
. Note that pip may not have the latest version of FAISS, and thus, some of the latest features and bug fixes may not be available.”
Problem solved. Thanks for sharing!
Hi, in Chapter 5 “Semantic search with FAISS”, when doing the last step of “loading and preparing the dataset”:
def concatenate_text(examples):
return {
"text": examples["title"]
+ " \n "
+ examples["body"]
+ " \n "
+ examples["comments"]
}
comments_dataset = comments_dataset.map(concatenate_text)
I got an error:
TypeError: can only concatenate list (not “str”) to list.
What is the reason to it? How to fix it? Thanks
Question - RAG uses FAISS under the hood, is that so?
I think the embedding vector similiarity matching reminds me of what RAG is about, just checking to see if it’s fundamentally the same construct?
Hey @lewtun
I was trying to do this exercise as well.
There are several conditions which are not in train but are in val or test conditions as seen above. Should we simply drop those samples from val and test sets then?
Thanks!
I fixed the error casting the ‘milestone’ type to string first.
import pandas as pd
from datasets import load_dataset
# Load the JSONL file into a pandas DataFrame
df = pd.read_json("datasets-issues.jsonl", lines=True)
df['milestone'] = df['milestone'].astype("str")
# Save the modified DataFrame back to a JSONL file
df.to_json("datasets-issues-fixed.jsonl", orient="records", lines=True)
# Now try loading the dataset with the fixed file
issues_dataset = load_dataset("json", data_files="datasets-issues-fixed.jsonl")
issues_dataset
How can I check for spelling and grammatical mistakes in the text data? Thank you.
Hi, I solved this problem by changing my Python version to 3.9.
Thanks for the hint.
Verified working 2025-01-13
Standard Kaggle Notebook
Possible issue in the fetch_issues function in the “Create your own database” course section.
According to the course documentation, if we provide a GitHub PAT in the request header, we can make 5,000 API calls to GitHub per hour.
With num_issues set to 10,000 and per_page set to 100, we will be making only 100 API calls total. However, the following line triggers a 1 hour “sleep” if the size of the batch collection is greater than the rate limit. The batch contains all the record accumulated so far on each request. So after 50 API calls it will contain 5,000 records and trigger the sleep.
if len(batch) > rate_limit and len(all_issues) < num_issues:
With 5,000 or less issues, this would never trigger a problem, but we are now at over 7,000 issues on this repo.
I believe this logic should be testing for how many requests have been made so far rather than how many records have been downloaded. I changed my code to the following to fix the issue.
if len(batch)/per_page > rate_limit and len(all_issues) < num_issues:
Another issue related to the load_dataset function which loads the file created by the fetch_issues function in the “Create your own database” course section.
issues_dataset = load_dataset("json", data_files="datasets-issues.jsonl", split="train")
When the above line is run an error is produced
...
TypeError: Couldn't cast array of type timestamp[s] to null
The above exception was the direct cause of the following exception:
...
DatasetGenerationError: An error occurred while generating the dataset
I seems that maybe the response JSON structure GitHub returns has significantly changed since the course was written. It now seems to contain date properties both at the root and in nested objects. In python terms nested dict objects.
As other posters have mentioned the load_dataset function errors when encountering a null or None value in a property previously determined to be a date type. I updated iotengtr solution since there seem to be more timestamps now.
Created a function that can be called recursively to handle nested dicts.
def dateFixer(inDict):
""" Replaces null or None date properties with a default date
Args:
inDict (dict): The dict to clean up
"""
default_date = "1970-01-01T00:00:00Z"
# Iterate over the keys of the dict object
for key in inDict:
# if that value type is a dict then make a recursive call
if isinstance(inDict[key], dict):
dateFixer(inDict[key])
else:
# Timestamps seem to have "_at" in the name such as "created_at" or "_on" -> "due_on"
if ("_at" in key or "_on" in key) and inDict[key] is None:
#print(f"{issue_count} - Replacing {key}: {issue[key]} with {default_date}")
inDict[key]= default_date
Paste this code into the fetch_issues function before the line creating the Pandas dataframe
# Replace missing timestamp values with a default value
for issue in all_issues:
dateFixer(issue)
Another Issue - Hit the 3 consecutive replies limit.
Unfortunately, I am working on this now and can’t come back later after someone else posts. So stacking up here hoping it helps someone in the future.
In section “Creating Your Own Database” under “Augmenting the dataset” getting another error calling the “get_comments” function.
TypeError: string indices must be integers
The issue seems to be the following line. We are trying to index r with a string assuming in is a dict object. However, some of the objects returned by the thousands of API calls must not be a dict.
return [r["body"] for r in response.json()]
To solve I updated the get_comments function with some type checking.
def get_comments(issue_number):
"""Returns a list of comments from the GitHub API for the given issue.
Args:
issue_number (int): The id of the issue to insert into the API URL
Returns:
list: A list of strings representing the comments for the given issue.
"""
url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
response = requests.get(url, headers=headers)
# response.json() returns a list object of dict objects
# The following line assumes r is a dict object, but there is a TypeError being returned
# return [r["body"] for r in response.json()]
# Which might indicate that for some responses r is a string
# We are performing 1,000s of API calls here need to have some type checking
response_json = response.json()
issue_comments = []
if isinstance(response_json, list):
for r in response_json:
# more type checking and ensure body property is present
if isinstance(r, dict) and "body" in r:
issue_comments.append(r["body"])
elif isinstance(r, str):
print(f"Found str not dict: {r}")
else:
print(f"Found unknown not dict: {r}")
elif isinstance(response_json, str):
print(f"Found str not list: {response_json}")
else:
print(f"Found unknown not list: {response_json}")
return issue_comments
Running this function revealed a GitHub rate limit issue because there are over 7,000 issues in the dataset now and we are performing an individual call to get the comments for each one. After the rate limit the API returns a dict object with the 403 message rather than a list object with data. This also creates a TypeError.
One factor that may be creating the “it suddenly works after I come back to it later” effect is that if you are running through this section in less than one hour, you use up the rate limit in the earlier steps. Then in the later steps we hit our limit.
So, I updated the fetch_issues function call to override the num_issues parameter with a default of 10,000 to 2,000. This should get some of the data but also allow head room to run the later steps which also hit the GitHub API
# Depending on your internet connection, this can take several minutes to run...
all_issues = fetch_issues(num_issues=2_000)