Chapter 5 questions

sgugger · November 15, 2021, 2:12pm

Use this topic for any question about Chapter 5 of the course.

ducatyb · December 3, 2021, 5:52am

Please correct me if I’m wrong.

A DataSet object can be thought of as some tabular data, whose rows and columns are examples and features respectively. The length of a DataSet object is the number of examples, which equals to the number of its rows. Each column corresponds to one feature.

Given this understanding of DataSet, I found the following descriptions in chapter 5 are confusing (in other words incorrect?).

… here those 1,000 examples gave 1,463 new features, resulting in a shape error.

1,463 is the number of rows (i.e. examples) of the newly added columns (i.e. features) such as attention_mask, input_ids, etc.

We can check that our new dataset has many more features than the original dataset by comparing the lengths:

len(tokenized_dataset[“train”]), len(drug_dataset[“train”])

(206772, 138514)

Obviously the above two numbers are the numbers of rows, i.e. the numbers of examples, not the number of columns (features). The number of features in this case is 4. Specifically, these 4 features are

attention_mask
input_ids
overflow_to_sample_mapping

and

token_type_ids

lewtun · December 3, 2021, 8:19am

Hey @ducatyb, your understanding is indeed correct: rows are examples, and columns are features.

We sometimes use “examples” and “features” interchangeably, but I agree this is confusing. I’ll improve the wording to make this clear - thank you for the feedback!

ducatyb · December 4, 2021, 9:39am

Thanks for clarifying!

ducatyb · December 4, 2021, 11:00am

In the section Creating your own dataset I got an import error when trying this code

from huggingface_hub import list_datasets

all_datasets = list_datasets()
print(f"Number of datasets on Hub: {len(all_datasets)}")
print(all_datasets[0])

Perhaps the package name should’ve been datasets instead of huggingface_hub ?

ducatyb · December 4, 2021, 12:20pm

My bad. I was using an older version of huggingface_hub. The import error has gone after I installed the latest version, i.e. version 0.2.1

dansbecker · December 5, 2021, 5:27am

The course is my first experience with HF, I’m totally wowed. Thanks for the great course and great tools!

For question in the end of chapter quiz for chapter 5, the grading system shows the correct answer as dataset.shuffle().select(range(len(50)).

I think you don’t mean to have the len in there, and it would instead be dataset.shuffle().select(range(50)).

lewtun · December 5, 2021, 10:09pm

Hi @dansbecker thank you for the kind words!

You’re totally right about the quiz - that’s a bug which I’ll fix right away

sgugger · December 6, 2021, 3:25am

I disagree with that analysis. Here we are in a situation where we create several training samples from one example by applying some preprocessing, which is generally called feature extraction. In tabular data feature extraction often means adding new columns to the dataset (although it sometimes means removing some), which leads to people often calling the columns features. Here in this example, it means adding more rows, wo calling the rows the features makes perfect sense to me.

We could however add a remark to explain this

satpalsr · December 14, 2021, 5:26am

In HF Chapter 5, “Time to slice and dice”: Time to slice and dice - Hugging Face Course

I observe

new_drug_dataset = drug_dataset.map(
      lambda x: {"review": html.unescape(x["review"])},
      batched=True)

to work slightly better than the used

new_drug_dataset = drug_dataset.map(
      lambda x: {"review": [html.unescape(o) for o in x["review"]]},
      batched=True
)

Results:

lewtun · December 14, 2021, 10:15am

Hey @satpalsr that’s an interesting result, but I think you’ll find that html.unescape() won’t actually unescape the HTML characters in your first example because it expects a string, not a list of strings.

You can check the outputs by inspecting the first element of the train split in both cases

DoyyingFace · January 18, 2022, 5:10pm

Hi! I have an issue when saving the dataset:

drug_dataset_clean.save_to_disk("drug-reviews")

A TypeError has been raised.
I’ve tried another dataset with the type Dataset (freq_dataset in the tutorial, for example), and it works. So maybe the problem is the data type of drug_dataset_clean.

DoyyingFace · January 19, 2022, 10:40am

I think there is a mistake in the ‘Creating your own dataset’ section:

issues_with_comments_dataset = issues_dataset.map(
    lambda x: {"comments": get_comments(x["number"])}
)

It will raise a TypeError because the data type of x[‘number’] is numpy.int64 instead of int, and the former could not be used as an index. So

issues_with_comments_dataset = issues_dataset.map(
    lambda x: {"comments": get_comments(int(x["number"]))}
)

will work.

lewtun · January 20, 2022, 11:39am

Hi @DoyyingFace, thank you for reporting this error! Unfortunately, I am not able to reproduce the error using the code in the chapter’s Colab. Perhaps you missed a cell that needed executing or are using an old version of datasets? If the problem remains, can you share a copy of the notebook that you’re running?

lewtun · January 21, 2022, 2:25pm

Hi @DoyyingFace thank you for raising this error. Unfortunately I am not able to reproduce it using the Colab notebook provided with the chapter. Perhaps you are using an old version of datasets? If the problem persists, can you please share the notebook you are getting the error in?

DoyyingFace · January 23, 2022, 2:10am

Hi! I ran the notebook attached in the tutorial and it worked as you said (for both replies I made here). So maybe there is something wrong with my notebook, and I might check my version. Thanks for your help!
BTW, what is about the old version of datasets?

lewtun · January 24, 2022, 3:54pm

Glad to hear it is working! The comment I made about the datasets version is that there each release often contains various bug fixes, so upgrading to the latest version is often a quick way to ensure a bug is really coming from the code, not the library

Evan · March 5, 2022, 1:40pm

Hello Everyone!

I’m trying to scrape spaCy’s github issues using the steps outlined in [Creating your own dataset](Creating your own dataset) as recommended in the Try it out! at the end of the tutorial, which, btw, are fantastic when I hit the following error:

from datasets import load_dataset

issues_spacy = load_dataset('json',
                            data_files="spacy-issues.jsonl",
                            split='train')

issues_spacy


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-aeab07c4b7d1> in <module>()
      3 issues_spacy = load_dataset('json',
      4                             data_files="spacy-issues.jsonl",
----> 5                             split='train')
      6 
      7 issues_spacy

14 frames
/usr/local/lib/python3.7/dist-packages/datasets/table.py in array_cast(array, pa_type, allow_number_to_str)
   1017             )
   1018         return array.cast(pa_type)
-> 1019     raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{pa_type}")
   1020 
   1021 

TypeError: Couldn't cast array of type
struct<url: string, html_url: string, labels_url: string, id: int64, node_id: string, number: int64, title: string, description: string, creator: struct<login: string, id: int64, node_id: string, avatar_url: string, gravatar_id: string, url: string, html_url: string, followers_url: string, following_url: string, gists_url: string, starred_url: string, subscriptions_url: string, organizations_url: string, repos_url: string, events_url: string, received_events_url: string, type: string, site_admin: bool>, open_issues: int64, closed_issues: int64, state: string, created_at: timestamp[s], updated_at: timestamp[s], due_on: null, closed_at: timestamp[s]>
to
null

I’ve already searched for what to do but am at a loss at the moment.

Any ideas?

Cheers!

lewtun · March 9, 2022, 11:13am

Hey @Evan thanks for reporting this error! It looks like it might be a low-level problem with the way we parse JSON files in datasets. Would you mind uploading your dataset to the Hub and sharing it here so I can try to reproduce it on my side?

Thanks!

Evan · March 13, 2022, 9:28pm

Hey @lewtun! A thousand pardons for not responding sooner.

I’m actually having a really hard time reproducing the error myself: when I set num_issues to 2_500 or 5_000, everything runs just fine. However, when I bump it up to 10_000 in the example, I get the error above

As for loading the data to the hub, I’ve uploaded the .json file here: Evan/spaCy-github-issues · Datasets at Hugging Face

Cheers!

Topic		Replies	Views
Fetching rows of a large Dataset by index 🤗Datasets	10	1654	March 15, 2021
Correct way to create a Dataset from a csv file Beginners	13	14173	March 25, 2022
Loading Custom Datasets 🤗Datasets	7	10749	May 25, 2021
.get_nearest_examples() throws ArrowInvalid: offset overflow while concatenating arrays 🤗Datasets	4	3069	September 30, 2020
Map method to tokenize raises index error 🤗Datasets	9	4294	June 9, 2021

Chapter 5 questions

Related topics