Chapter 5 questions

Use this topic for any question about Chapter 5 of the course.

Please correct me if Iā€™m wrong.

A DataSet object can be thought of as some tabular data, whose rows and columns are examples and features respectively. The length of a DataSet object is the number of examples, which equals to the number of its rows. Each column corresponds to one feature.

Given this understanding of DataSet, I found the following descriptions in chapter 5 are confusing (in other words incorrect?).

ā€¦ here those 1,000 examples gave 1,463 new features, resulting in a shape error.

1,463 is the number of rows (i.e. examples) of the newly added columns (i.e. features) such as attention_mask, input_ids, etc.

We can check that our new dataset has many more features than the original dataset by comparing the lengths:

len(tokenized_dataset[ā€œtrainā€]), len(drug_dataset[ā€œtrainā€])

(206772, 138514)

Obviously the above two numbers are the numbers of rows, i.e. the numbers of examples, not the number of columns (features). The number of features in this case is 4. Specifically, these 4 features are

attention_mask
input_ids
overflow_to_sample_mapping

and

token_type_ids

3 Likes

Hey @ducatyb, your understanding is indeed correct: rows are examples, and columns are features.

We sometimes use ā€œexamplesā€ and ā€œfeaturesā€ interchangeably, but I agree this is confusing. Iā€™ll improve the wording to make this clear - thank you for the feedback!

4 Likes

Thanks for clarifying!

In the section Creating your own dataset I got an import error when trying this code

from huggingface_hub import list_datasets

all_datasets = list_datasets()
print(f"Number of datasets on Hub: {len(all_datasets)}")
print(all_datasets[0])

Perhaps the package name shouldā€™ve been datasets instead of huggingface_hub ?

My bad. I was using an older version of huggingface_hub. The import error has gone after I installed the latest version, i.e. version 0.2.1

1 Like

The course is my first experience with HF, Iā€™m totally wowed. Thanks for the great course and great tools!

For question in the end of chapter quiz for chapter 5, the grading system shows the correct answer as dataset.shuffle().select(range(len(50)).

I think you donā€™t mean to have the len in there, and it would instead be dataset.shuffle().select(range(50)).

2 Likes

Hi @dansbecker thank you for the kind words!

Youā€™re totally right about the quiz - thatā€™s a bug which Iā€™ll fix right away :slight_smile:

2 Likes

I disagree with that analysis. Here we are in a situation where we create several training samples from one example by applying some preprocessing, which is generally called feature extraction. In tabular data feature extraction often means adding new columns to the dataset (although it sometimes means removing some), which leads to people often calling the columns features. Here in this example, it means adding more rows, wo calling the rows the features makes perfect sense to me.

We could however add a remark to explain this :slight_smile:

In HF Chapter 5, ā€œTime to slice and diceā€: Time to slice and dice - Hugging Face Course

I observe

new_drug_dataset = drug_dataset.map(
      lambda x: {"review": html.unescape(x["review"])},
      batched=True)

to work slightly better than the used

new_drug_dataset = drug_dataset.map(
      lambda x: {"review": [html.unescape(o) for o in x["review"]]},
      batched=True
)

Results:

Hey @satpalsr thatā€™s an interesting result, but I think youā€™ll find that html.unescape() wonā€™t actually unescape the HTML characters in your first example because it expects a string, not a list of strings.

You can check the outputs by inspecting the first element of the train split in both cases :slight_smile:

1 Like

Hi! I have an issue when saving the dataset:

drug_dataset_clean.save_to_disk("drug-reviews")

A TypeError has been raised.
Iā€™ve tried another dataset with the type Dataset (freq_dataset in the tutorial, for example), and it works. So maybe the problem is the data type of drug_dataset_clean.

I think there is a mistake in the ā€˜Creating your own datasetā€™ section:

issues_with_comments_dataset = issues_dataset.map(
    lambda x: {"comments": get_comments(x["number"])}
)

It will raise a TypeError because the data type of x[ā€˜numberā€™] is numpy.int64 instead of int, and the former could not be used as an index. So

issues_with_comments_dataset = issues_dataset.map(
    lambda x: {"comments": get_comments(int(x["number"]))}
)

will work.

Hi @DoyyingFace, thank you for reporting this error! Unfortunately, I am not able to reproduce the error using the code in the chapterā€™s Colab. Perhaps you missed a cell that needed executing or are using an old version of datasets? If the problem remains, can you share a copy of the notebook that youā€™re running?

Hi @DoyyingFace thank you for raising this error. Unfortunately I am not able to reproduce it using the Colab notebook provided with the chapter. Perhaps you are using an old version of datasets? If the problem persists, can you please share the notebook you are getting the error in?

Hi! I ran the notebook attached in the tutorial and it worked as you said (for both replies I made here). So maybe there is something wrong with my notebook, and I might check my version. Thanks for your help!
BTW, what is about the old version of datasets?

Glad to hear it is working! The comment I made about the datasets version is that there each release often contains various bug fixes, so upgrading to the latest version is often a quick way to ensure a bug is really coming from the code, not the library :slight_smile:

Hello Everyone!

Iā€™m trying to scrape spaCyā€™s github issues using the steps outlined in [Creating your own dataset](Creating your own dataset) as recommended in the :pencil2: Try it out! at the end of the tutorial, which, btw, are fantastic when I hit the following error:

from datasets import load_dataset

issues_spacy = load_dataset('json',
                            data_files="spacy-issues.jsonl",
                            split='train')

issues_spacy


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-aeab07c4b7d1> in <module>()
      3 issues_spacy = load_dataset('json',
      4                             data_files="spacy-issues.jsonl",
----> 5                             split='train')
      6 
      7 issues_spacy

14 frames
/usr/local/lib/python3.7/dist-packages/datasets/table.py in array_cast(array, pa_type, allow_number_to_str)
   1017             )
   1018         return array.cast(pa_type)
-> 1019     raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{pa_type}")
   1020 
   1021 

TypeError: Couldn't cast array of type
struct<url: string, html_url: string, labels_url: string, id: int64, node_id: string, number: int64, title: string, description: string, creator: struct<login: string, id: int64, node_id: string, avatar_url: string, gravatar_id: string, url: string, html_url: string, followers_url: string, following_url: string, gists_url: string, starred_url: string, subscriptions_url: string, organizations_url: string, repos_url: string, events_url: string, received_events_url: string, type: string, site_admin: bool>, open_issues: int64, closed_issues: int64, state: string, created_at: timestamp[s], updated_at: timestamp[s], due_on: null, closed_at: timestamp[s]>
to
null

Iā€™ve already searched for what to do but am at a loss at the moment.

Any ideas?

Cheers!

Hey @Evan thanks for reporting this error! It looks like it might be a low-level problem with the way we parse JSON files in datasets. Would you mind uploading your dataset to the Hub and sharing it here so I can try to reproduce it on my side?

Thanks!

Hey @lewtun! A thousand pardons for not responding sooner.

Iā€™m actually having a really hard time reproducing the error myself: when I set num_issues to 2_500 or 5_000, everything runs just fine. However, when I bump it up to 10_000 in the example, I get the error above :confounded:

As for loading the data to the hub, Iā€™ve uploaded the .json file here: Evan/spaCy-github-issues Ā· Datasets at Hugging Face

Cheers!

1 Like