Chapter 5 questions

Thanks for sharing the dataset @Evan! I was able to reproduce your error, so have opened an issue on the datasets repo here: TypeError: Couldn't cast array of type for JSONLines dataset · Issue #3965 · huggingface/datasets · GitHub

1 Like

Hello everyone,

I am very new to the topic, so sorry if this question is obvious.

I’d like to start working on this task (Chapter 5 - Time to slice and dice):

  1. Use the techniques from Chapter 3 to train a classifier that can predict the patient condition based on the drug review.

Since this label (patient condition) is also a string (I think there are 819 unique conditions), what would be the best approach? I was thinking about tokenizing this field and then use a seq2seq model. Or maybe assign a number to each unique condition

Thanks for the great course!

Hey @juancopi81 what I had in mind was the second approach you describe - treat each condition as a label and try to train a multiclass classifier. Given so many labels, you might want to explore top-k accuracy as a metric, but the main goal of the exercise is to give you some practice training models in a new setting :slight_smile:

1 Like

Hi there,

the last “Try it out!” task here asks us

  1. to “create an own dataset of GitHub issues” and
  2. to “fine-tune a multilabel classifier” (for bonus points :wink:).

I have created this dataset. It has 57 different labels and an instance may be labelled with any combination of those. I would like to add the class label names ["bug", "benchmark", "performance", ...] to the dataset. Inspired by this forum post, I have tried the following, yet without success:

features = transformers_issues_labels.features.copy()
features["arr_labels"] = ClassLabel(names=unique_labels)
transformers_issues_labels = transformers_issues_labels.map(
    lambda batch: batch, batched=False, features=features
)

TypeError: Couldn't cast array of type list<item: int64> to int64

=> Two questions:

  1. How to build a classifier for this task (e.g. “MultiLabelFromPretrainedClassifier” or something like this…)?
  2. How can I add the class label names to my dataset (specifically to the “arr_labels” features, assuming this makes sense)?

P.s. In any case: Thanks a ton to all contributors of this course. I am learning a lot and looking forward to part 3.

Hey @mdroth as a hint, you can checkout the problem_type parameter of TrainingArguments - this allows you to configure the loss for multilabel problems :slight_smile:

You might also want to check out, the code associated with Chapter 9 of the our book, which covers a similar topic: notebooks/09_few-to-no-labels.ipynb at main · nlp-with-transformers/notebooks · GitHub

HTH!

Hi @lewtun, unfortunately, I couldn’t find find a problem_type parameter in the documentation of TrainingArguments (I am using transformers.__version__ = '4.17.0'). I do not want to bomb this topic with my very specific issue, so I created a new topic here.

I am also still curious about adding the class label names to the dataset (my 2nd item).

Any help is much appreciated.

Solved.

When doing:

from datasets import load_dataset

data_files = "https://mystic.the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True
)

and

next(iter(pubmed_dataset_streamed))

I get the error:
StopIteration:

When doing:

list(pubmed_dataset_streamed)

I get an empty list. Can you help me?