Use this topic for any question about Chapter 5 of the course.
Please correct me if Iām wrong.
A DataSet object can be thought of as some tabular data, whose rows and columns are examples and features respectively. The length of a DataSet object is the number of examples, which equals to the number of its rows. Each column corresponds to one feature.
Given this understanding of DataSet, I found the following descriptions in chapter 5 are confusing (in other words incorrect?).
ā¦ here those 1,000 examples gave 1,463 new features, resulting in a shape error.
1,463 is the number of rows (i.e. examples) of the newly added columns (i.e. features) such as attention_mask
, input_ids
, etc.
We can check that our new dataset has many more features than the original dataset by comparing the lengths:
len(tokenized_dataset[ātrainā]), len(drug_dataset[ātrainā])
(206772, 138514)
Obviously the above two numbers are the numbers of rows, i.e. the numbers of examples, not the number of columns (features). The number of features in this case is 4. Specifically, these 4 features are
attention_mask
input_ids
overflow_to_sample_mapping
and
token_type_ids
Hey @ducatyb, your understanding is indeed correct: rows are examples, and columns are features.
We sometimes use āexamplesā and āfeaturesā interchangeably, but I agree this is confusing. Iāll improve the wording to make this clear - thank you for the feedback!
Thanks for clarifying!
In the section Creating your own dataset I got an import error when trying this code
from huggingface_hub import list_datasets
all_datasets = list_datasets()
print(f"Number of datasets on Hub: {len(all_datasets)}")
print(all_datasets[0])
Perhaps the package name shouldāve been datasets
instead of huggingface_hub
?
My bad. I was using an older version of huggingface_hub
. The import error has gone after I installed the latest version, i.e. version 0.2.1
The course is my first experience with HF, Iām totally wowed. Thanks for the great course and great tools!
For question in the end of chapter quiz for chapter 5, the grading system shows the correct answer as dataset.shuffle().select(range(len(50)).
I think you donāt mean to have the len in there, and it would instead be dataset.shuffle().select(range(50)).
Hi @dansbecker thank you for the kind words!
Youāre totally right about the quiz - thatās a bug which Iāll fix right away
I disagree with that analysis. Here we are in a situation where we create several training samples from one example by applying some preprocessing, which is generally called feature extraction. In tabular data feature extraction often means adding new columns to the dataset (although it sometimes means removing some), which leads to people often calling the columns features. Here in this example, it means adding more rows, wo calling the rows the features makes perfect sense to me.
We could however add a remark to explain this
In HF Chapter 5, āTime to slice and diceā: Time to slice and dice - Hugging Face Course
I observe
new_drug_dataset = drug_dataset.map(
lambda x: {"review": html.unescape(x["review"])},
batched=True)
to work slightly better than the used
new_drug_dataset = drug_dataset.map(
lambda x: {"review": [html.unescape(o) for o in x["review"]]},
batched=True
)
Results:
Hey @satpalsr thatās an interesting result, but I think youāll find that html.unescape()
wonāt actually unescape the HTML characters in your first example because it expects a string, not a list of strings.
You can check the outputs by inspecting the first element of the train
split in both cases
Hi! I have an issue when saving the dataset:
drug_dataset_clean.save_to_disk("drug-reviews")
A TypeError has been raised.
Iāve tried another dataset with the type Dataset (freq_dataset in the tutorial, for example), and it works. So maybe the problem is the data type of drug_dataset_clean.
I think there is a mistake in the āCreating your own datasetā section:
issues_with_comments_dataset = issues_dataset.map(
lambda x: {"comments": get_comments(x["number"])}
)
It will raise a TypeError because the data type of x[ānumberā] is numpy.int64 instead of int, and the former could not be used as an index. So
issues_with_comments_dataset = issues_dataset.map(
lambda x: {"comments": get_comments(int(x["number"]))}
)
will work.
Hi @DoyyingFace, thank you for reporting this error! Unfortunately, I am not able to reproduce the error using the code in the chapterās Colab. Perhaps you missed a cell that needed executing or are using an old version of datasets
? If the problem remains, can you share a copy of the notebook that youāre running?
Hi @DoyyingFace thank you for raising this error. Unfortunately I am not able to reproduce it using the Colab notebook provided with the chapter. Perhaps you are using an old version of datasets
? If the problem persists, can you please share the notebook you are getting the error in?
Hi! I ran the notebook attached in the tutorial and it worked as you said (for both replies I made here). So maybe there is something wrong with my notebook, and I might check my version. Thanks for your help!
BTW, what is about the old version of datasets
?
Glad to hear it is working! The comment I made about the datasets
version is that there each release often contains various bug fixes, so upgrading to the latest version is often a quick way to ensure a bug is really coming from the code, not the library
Hello Everyone!
Iām trying to scrape spaCyās github issues using the steps outlined in [Creating your own dataset](Creating your own dataset) as recommended in the Try it out! at the end of the tutorial, which, btw, are fantastic when I hit the following error:
from datasets import load_dataset
issues_spacy = load_dataset('json',
data_files="spacy-issues.jsonl",
split='train')
issues_spacy
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-aeab07c4b7d1> in <module>()
3 issues_spacy = load_dataset('json',
4 data_files="spacy-issues.jsonl",
----> 5 split='train')
6
7 issues_spacy
14 frames
/usr/local/lib/python3.7/dist-packages/datasets/table.py in array_cast(array, pa_type, allow_number_to_str)
1017 )
1018 return array.cast(pa_type)
-> 1019 raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{pa_type}")
1020
1021
TypeError: Couldn't cast array of type
struct<url: string, html_url: string, labels_url: string, id: int64, node_id: string, number: int64, title: string, description: string, creator: struct<login: string, id: int64, node_id: string, avatar_url: string, gravatar_id: string, url: string, html_url: string, followers_url: string, following_url: string, gists_url: string, starred_url: string, subscriptions_url: string, organizations_url: string, repos_url: string, events_url: string, received_events_url: string, type: string, site_admin: bool>, open_issues: int64, closed_issues: int64, state: string, created_at: timestamp[s], updated_at: timestamp[s], due_on: null, closed_at: timestamp[s]>
to
null
Iāve already searched for what to do but am at a loss at the moment.
Any ideas?
Cheers!
Hey @Evan thanks for reporting this error! It looks like it might be a low-level problem with the way we parse JSON files in datasets
. Would you mind uploading your dataset to the Hub and sharing it here so I can try to reproduce it on my side?
Thanks!
Hey @lewtun! A thousand pardons for not responding sooner.
Iām actually having a really hard time reproducing the error myself: when I set num_issues
to 2_500 or 5_000, everything runs just fine. However, when I bump it up to 10_000 in the example, I get the error above
As for loading the data to the hub, Iāve uploaded the .json file here: Evan/spaCy-github-issues Ā· Datasets at Hugging Face
Cheers!