Chapter 5 questions

Use this topic for any question about Chapter 5 of the course.

Please correct me if I’m wrong.

A DataSet object can be thought of as some tabular data, whose rows and columns are examples and features respectively. The length of a DataSet object is the number of examples, which equals to the number of its rows. Each column corresponds to one feature.

Given this understanding of DataSet, I found the following descriptions in chapter 5 are confusing (in other words incorrect?).

… here those 1,000 examples gave 1,463 new features, resulting in a shape error.

1,463 is the number of rows (i.e. examples) of the newly added columns (i.e. features) such as attention_mask, input_ids, etc.

We can check that our new dataset has many more features than the original dataset by comparing the lengths:

len(tokenized_dataset[“train”]), len(drug_dataset[“train”])

(206772, 138514)

Obviously the above two numbers are the numbers of rows, i.e. the numbers of examples, not the number of columns (features). The number of features in this case is 4. Specifically, these 4 features are

attention_mask
input_ids
overflow_to_sample_mapping

and

token_type_ids

1 Like

Hey @ducatyb, your understanding is indeed correct: rows are examples, and columns are features.

We sometimes use “examples” and “features” interchangeably, but I agree this is confusing. I’ll improve the wording to make this clear - thank you for the feedback!

2 Likes

Thanks for clarifying!

In the section Creating your own dataset I got an import error when trying this code

from huggingface_hub import list_datasets

all_datasets = list_datasets()
print(f"Number of datasets on Hub: {len(all_datasets)}")
print(all_datasets[0])

Perhaps the package name should’ve been datasets instead of huggingface_hub ?

My bad. I was using an older version of huggingface_hub. The import error has gone after I installed the latest version, i.e. version 0.2.1

1 Like

The course is my first experience with HF, I’m totally wowed. Thanks for the great course and great tools!

For question in the end of chapter quiz for chapter 5, the grading system shows the correct answer as dataset.shuffle().select(range(len(50)).

I think you don’t mean to have the len in there, and it would instead be dataset.shuffle().select(range(50)).

Hi @dansbecker thank you for the kind words!

You’re totally right about the quiz - that’s a bug which I’ll fix right away :slight_smile:

1 Like

I disagree with that analysis. Here we are in a situation where we create several training samples from one example by applying some preprocessing, which is generally called feature extraction. In tabular data feature extraction often means adding new columns to the dataset (although it sometimes means removing some), which leads to people often calling the columns features. Here in this example, it means adding more rows, wo calling the rows the features makes perfect sense to me.

We could however add a remark to explain this :slight_smile:

In HF Chapter 5, “Time to slice and dice”: The 🤗 Datasets library - Hugging Face Course

I observe

new_drug_dataset = drug_dataset.map(
      lambda x: {"review": html.unescape(x["review"])},
      batched=True)

to work slightly better than the used

new_drug_dataset = drug_dataset.map(
      lambda x: {"review": [html.unescape(o) for o in x["review"]]},
      batched=True
)

Results:

Hey @satpalsr that’s an interesting result, but I think you’ll find that html.unescape() won’t actually unescape the HTML characters in your first example because it expects a string, not a list of strings.

You can check the outputs by inspecting the first element of the train split in both cases :slight_smile:

1 Like

Hi! I have an issue when saving the dataset:

drug_dataset_clean.save_to_disk("drug-reviews")

A TypeError has been raised.
I’ve tried another dataset with the type Dataset (freq_dataset in the tutorial, for example), and it works. So maybe the problem is the data type of drug_dataset_clean.

I think there is a mistake in the ‘Creating your own dataset’ section:

issues_with_comments_dataset = issues_dataset.map(
    lambda x: {"comments": get_comments(x["number"])}
)

It will raise a TypeError because the data type of x[‘number’] is numpy.int64 instead of int, and the former could not be used as an index. So

issues_with_comments_dataset = issues_dataset.map(
    lambda x: {"comments": get_comments(int(x["number"]))}
)

will work.

Hi @DoyyingFace, thank you for reporting this error! Unfortunately, I am not able to reproduce the error using the code in the chapter’s Colab. Perhaps you missed a cell that needed executing or are using an old version of datasets? If the problem remains, can you share a copy of the notebook that you’re running?

Hi @DoyyingFace thank you for raising this error. Unfortunately I am not able to reproduce it using the Colab notebook provided with the chapter. Perhaps you are using an old version of datasets? If the problem persists, can you please share the notebook you are getting the error in?

Hi! I ran the notebook attached in the tutorial and it worked as you said (for both replies I made here). So maybe there is something wrong with my notebook, and I might check my version. Thanks for your help!
BTW, what is about the old version of datasets?