Chapter 5 questions

lewtun · March 18, 2022, 3:18pm

Thanks for sharing the dataset @Evan! I was able to reproduce your error, so have opened an issue on the datasets repo here: TypeError: Couldn't cast array of type for JSONLines dataset · Issue #3965 · huggingface/datasets · GitHub

juancopi81 · March 30, 2022, 2:43pm

Hello everyone,

I am very new to the topic, so sorry if this question is obvious.

I’d like to start working on this task (Chapter 5 - Time to slice and dice):

Use the techniques from Chapter 3 to train a classifier that can predict the patient condition based on the drug review.

Since this label (patient condition) is also a string (I think there are 819 unique conditions), what would be the best approach? I was thinking about tokenizing this field and then use a seq2seq model. Or maybe assign a number to each unique condition

Thanks for the great course!

lewtun · March 30, 2022, 3:01pm

Hey @juancopi81 what I had in mind was the second approach you describe - treat each condition as a label and try to train a multiclass classifier. Given so many labels, you might want to explore top-k accuracy as a metric, but the main goal of the exercise is to give you some practice training models in a new setting

mdroth · May 18, 2022, 11:53pm

Hi there,

the last “Try it out!” task here asks us

to “create an own dataset of GitHub issues” and
to “fine-tune a multilabel classifier” (for bonus points ).

I have created this dataset. It has 57 different labels and an instance may be labelled with any combination of those. I would like to add the class label names ["bug", "benchmark", "performance", ...] to the dataset. Inspired by this forum post, I have tried the following, yet without success:

features = transformers_issues_labels.features.copy()
features["arr_labels"] = ClassLabel(names=unique_labels)
transformers_issues_labels = transformers_issues_labels.map(
    lambda batch: batch, batched=False, features=features
)

TypeError: Couldn't cast array of type list<item: int64> to int64

=> Two questions:

How to build a classifier for this task (e.g. “MultiLabelFromPretrainedClassifier” or something like this…)?
How can I add the class label names to my dataset (specifically to the “arr_labels” features, assuming this makes sense)?

P.s. In any case: Thanks a ton to all contributors of this course. I am learning a lot and looking forward to part 3.

lewtun · May 19, 2022, 7:31am

Hey @mdroth as a hint, you can checkout the problem_type parameter of TrainingArguments - this allows you to configure the loss for multilabel problems

You might also want to check out, the code associated with Chapter 9 of the our book, which covers a similar topic: notebooks/09_few-to-no-labels.ipynb at main · nlp-with-transformers/notebooks · GitHub

HTH!

mdroth · May 21, 2022, 2:18am

Hi @lewtun, unfortunately, I couldn’t find find a problem_type parameter in the documentation of TrainingArguments (I am using transformers.__version__ = '4.17.0'). I do not want to bomb this topic with my very specific issue, so I created a new topic here.

I am also still curious about adding the class label names to the dataset (my 2nd item).

Any help is much appreciated.

mdroth · May 23, 2022, 5:50am

Solved.

ShadowTwin41 · September 23, 2022, 5:37am

When doing:

from datasets import load_dataset

data_files = "https://mystic.the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True
)

and

next(iter(pubmed_dataset_streamed))

I get the error:
StopIteration:

When doing:

list(pubmed_dataset_streamed)

I get an empty list. Can you help me?

Teme · September 30, 2022, 2:03pm

Apologies if this has been addressed elsewhere, but when I try to load the dataset, I got the below erro:

from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://mystic.the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset

ConnectionError: HTTPSConnectionPool(host=‘mystic.the-eye.eu’, port=443): Max retries exceeded with url: /public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst (Caused by NewConnectionError(‘<urllib3.connection.HTTPSConnection object at 0x7f3bc5c6bd50>: Failed to establish a new connection: [Errno 111] Connection refused’))

I changed the url to

data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"

and now the datasets loads successfully. I thought I might share it here in case anyone else got stuck there.

lewtun · September 30, 2022, 3:30pm

Thanks a lot @Teme - it seems like the Pile did indeed shift location! I’ve included your fix here: Fix URL to the Pile by lewtun · Pull Request #324 · huggingface/course · GitHub

andy13771 · October 15, 2022, 7:38pm

Hi everyone!
I’m looking through the 5th chapter and just wanted to ask.
In the Creating your own dataset part when looping over the pages in the fetch_issues function is there a reason why it’s tqdm(range(num_pages)) instead of just trange(num_pages)?

tomjam · October 17, 2022, 8:04am

Hello I’m getting a ValueError with the following line on colab.

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])

“ValueError: The features can’t be aligned because the key meta of features”

Yassershrief · December 4, 2022, 12:43pm

Thank you for your great effort.

I just faced an error after request data-issues from Github , i can’t load data
any advice?

Thank you in advance

AmoebaCodes · December 5, 2022, 12:59am

In the section: semantic hashing with FAISS, I’m using .map() to apply the get_embeddings() function to every entry in my own dataset. My dataset is very large, so I am trying to specify num_proc >0 in map() to parallelize the process. However, the program deadlocks and stalls without going through even one datapoint. How should I modify get_embeddings() so it supports parallelism? Thanks!

rootacess · December 24, 2022, 1:53pm

I am facing issues understanding return_overflowing_tokens parameter from tokenizer.
Can someone explain me more about what overflowing tokens are and what it means?

rootacess · December 24, 2022, 1:58pm

I got it now

MilkoZgnilko · January 24, 2023, 3:13pm

I have an error executing:

issues_dataset = load_dataset(“json”, data_files=“datasets-issues.jsonl”, split=“train”)

I just open a notebook and run everything and receive following error:

TypeError: Couldn’t cast array of type
struct<url: string, html_url: string, labels_url: string, id: int64, node_id: string, number: int64, title: string, description: string, creator: struct<login: string, id: int64, node_id: string, avatar_url: string, gravatar_id: string, url: string, html_url: string, followers_url: string, following_url: string, gists_url: string, starred_url: string, subscriptions_url: string, organizations_url: string, repos_url: string, events_url: string, received_events_url: string, type: string, site_admin: bool>, open_issues: int64, closed_issues: int64, state: string, created_at: timestamp[s], updated_at: timestamp[s], due_on: timestamp[s], closed_at: timestamp[s]>
to
null

(…)

DatasetGenerationError: An error occurred while generating the dataset

How to solve this problem?

mholi · January 24, 2023, 5:50pm

I have the same problem.

evangeliazve · February 8, 2023, 5:33pm

Hello, thanks a lot for this tutorial. Is there any way to push the search engine created to HuggingFace Hub and then use the Inference API to make calls for similarity prediction ?

Actually I stored my embeddings in .pickle file as you did. How can I proceed to create an inference end point to call this for similarity search ?

Mentatko · February 15, 2023, 5:10am

I also have this problem.

Topic		Replies	Views
The 🤗 Datasets library - Hugging Face Course 🤗Datasets	1	568	November 25, 2021
Got wrong row number of dataset viewer 🤗Hub	11	604	June 26, 2024
Undesired behavior when using load_dataset 🤗Datasets	4	945	April 17, 2023
Mapping 1 multi-element column of a dataset to multi row dataset with 1 element per row, duplicating other features 🤗Datasets	6	2529	November 4, 2022
How to operate on columns of a dataset Beginners	2	152	January 30, 2025

Chapter 5 questions

Related topics