Chapter 5 questions

Thanks for sharing the dataset @Evan! I was able to reproduce your error, so have opened an issue on the datasets repo here: TypeError: Couldn't cast array of type for JSONLines dataset · Issue #3965 · huggingface/datasets · GitHub

1 Like

Hello everyone,

I am very new to the topic, so sorry if this question is obvious.

I’d like to start working on this task (Chapter 5 - Time to slice and dice):

  1. Use the techniques from Chapter 3 to train a classifier that can predict the patient condition based on the drug review.

Since this label (patient condition) is also a string (I think there are 819 unique conditions), what would be the best approach? I was thinking about tokenizing this field and then use a seq2seq model. Or maybe assign a number to each unique condition

Thanks for the great course!

Hey @juancopi81 what I had in mind was the second approach you describe - treat each condition as a label and try to train a multiclass classifier. Given so many labels, you might want to explore top-k accuracy as a metric, but the main goal of the exercise is to give you some practice training models in a new setting :slight_smile:

1 Like

Hi there,

the last “Try it out!” task here asks us

  1. to “create an own dataset of GitHub issues” and
  2. to “fine-tune a multilabel classifier” (for bonus points :wink:).

I have created this dataset. It has 57 different labels and an instance may be labelled with any combination of those. I would like to add the class label names ["bug", "benchmark", "performance", ...] to the dataset. Inspired by this forum post, I have tried the following, yet without success:

features = transformers_issues_labels.features.copy()
features["arr_labels"] = ClassLabel(names=unique_labels)
transformers_issues_labels = transformers_issues_labels.map(
    lambda batch: batch, batched=False, features=features
)

TypeError: Couldn't cast array of type list<item: int64> to int64

=> Two questions:

  1. How to build a classifier for this task (e.g. “MultiLabelFromPretrainedClassifier” or something like this…)?
  2. How can I add the class label names to my dataset (specifically to the “arr_labels” features, assuming this makes sense)?

P.s. In any case: Thanks a ton to all contributors of this course. I am learning a lot and looking forward to part 3.

Hey @mdroth as a hint, you can checkout the problem_type parameter of TrainingArguments - this allows you to configure the loss for multilabel problems :slight_smile:

You might also want to check out, the code associated with Chapter 9 of the our book, which covers a similar topic: notebooks/09_few-to-no-labels.ipynb at main · nlp-with-transformers/notebooks · GitHub

HTH!

Hi @lewtun, unfortunately, I couldn’t find find a problem_type parameter in the documentation of TrainingArguments (I am using transformers.__version__ = '4.17.0'). I do not want to bomb this topic with my very specific issue, so I created a new topic here.

I am also still curious about adding the class label names to the dataset (my 2nd item).

Any help is much appreciated.

Solved.

When doing:

from datasets import load_dataset

data_files = "https://mystic.the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True
)

and

next(iter(pubmed_dataset_streamed))

I get the error:
StopIteration:

When doing:

list(pubmed_dataset_streamed)

I get an empty list. Can you help me?

Apologies if this has been addressed elsewhere, but when I try to load the dataset, I got the below erro:

from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://mystic.the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset

ConnectionError: HTTPSConnectionPool(host=‘mystic.the-eye.eu’, port=443): Max retries exceeded with url: /public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst (Caused by NewConnectionError(‘<urllib3.connection.HTTPSConnection object at 0x7f3bc5c6bd50>: Failed to establish a new connection: [Errno 111] Connection refused’))

I changed the url to

data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"

and now the datasets loads successfully. I thought I might share it here in case anyone else got stuck there.

Thanks a lot @Teme - it seems like the Pile did indeed shift location! I’ve included your fix here: Fix URL to the Pile by lewtun · Pull Request #324 · huggingface/course · GitHub

Hi everyone!
I’m looking through the 5th chapter and just wanted to ask.
In the Creating your own dataset part when looping over the pages in the fetch_issues function is there a reason why it’s tqdm(range(num_pages)) instead of just trange(num_pages)?

Hello I’m getting a ValueError with the following line on colab.

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])

“ValueError: The features can’t be aligned because the key meta of features”

Thank you for your great effort.

I just faced an error after request data-issues from Github , i can’t load data
any advice?


Thank you in advance

2 Likes

In the section: semantic hashing with FAISS, I’m using .map() to apply the get_embeddings() function to every entry in my own dataset. My dataset is very large, so I am trying to specify num_proc >0 in map() to parallelize the process. However, the program deadlocks and stalls without going through even one datapoint. How should I modify get_embeddings() so it supports parallelism? Thanks!

I am facing issues understanding return_overflowing_tokens parameter from tokenizer.
Can someone explain me more about what overflowing tokens are and what it means?

I got it now :hugs:

I have an error executing:

issues_dataset = load_dataset(“json”, data_files=“datasets-issues.jsonl”, split=“train”)

I just open a notebook and run everything and receive following error:

TypeError: Couldn’t cast array of type
struct<url: string, html_url: string, labels_url: string, id: int64, node_id: string, number: int64, title: string, description: string, creator: struct<login: string, id: int64, node_id: string, avatar_url: string, gravatar_id: string, url: string, html_url: string, followers_url: string, following_url: string, gists_url: string, starred_url: string, subscriptions_url: string, organizations_url: string, repos_url: string, events_url: string, received_events_url: string, type: string, site_admin: bool>, open_issues: int64, closed_issues: int64, state: string, created_at: timestamp[s], updated_at: timestamp[s], due_on: timestamp[s], closed_at: timestamp[s]>
to
null

(…)

DatasetGenerationError: An error occurred while generating the dataset

How to solve this problem?

1 Like

I have the same problem.

1 Like

Hello, thanks a lot for this tutorial. Is there any way to push the search engine created to HuggingFace Hub and then use the Inference API to make calls for similarity prediction ?

Actually I stored my embeddings in .pickle file as you did. How can I proceed to create an inference end point to call this for similarity search ?

I also have this problem.