Chapter 5 questions

I think there is a mistake in the ‘Creating your own dataset’ section:

issues_with_comments_dataset =
    lambda x: {"comments": get_comments(x["number"])}

It will raise a TypeError because the data type of x[‘number’] is numpy.int64 instead of int, and the former could not be used as an index. So

issues_with_comments_dataset =
    lambda x: {"comments": get_comments(int(x["number"]))}

will work.

Hi @DoyyingFace, thank you for reporting this error! Unfortunately, I am not able to reproduce the error using the code in the chapter’s Colab. Perhaps you missed a cell that needed executing or are using an old version of datasets? If the problem remains, can you share a copy of the notebook that you’re running?

Hi @DoyyingFace thank you for raising this error. Unfortunately I am not able to reproduce it using the Colab notebook provided with the chapter. Perhaps you are using an old version of datasets? If the problem persists, can you please share the notebook you are getting the error in?

Hi! I ran the notebook attached in the tutorial and it worked as you said (for both replies I made here). So maybe there is something wrong with my notebook, and I might check my version. Thanks for your help!
BTW, what is about the old version of datasets?

Glad to hear it is working! The comment I made about the datasets version is that there each release often contains various bug fixes, so upgrading to the latest version is often a quick way to ensure a bug is really coming from the code, not the library :slight_smile:

Hello Everyone!

I’m trying to scrape spaCy’s github issues using the steps outlined in [Creating your own dataset](Creating your own dataset) as recommended in the :pencil2: Try it out! at the end of the tutorial, which, btw, are fantastic when I hit the following error:

from datasets import load_dataset

issues_spacy = load_dataset('json',


TypeError                                 Traceback (most recent call last)
<ipython-input-10-aeab07c4b7d1> in <module>()
      3 issues_spacy = load_dataset('json',
      4                             data_files="spacy-issues.jsonl",
----> 5                             split='train')
      7 issues_spacy

14 frames
/usr/local/lib/python3.7/dist-packages/datasets/ in array_cast(array, pa_type, allow_number_to_str)
   1017             )
   1018         return array.cast(pa_type)
-> 1019     raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{pa_type}")

TypeError: Couldn't cast array of type
struct<url: string, html_url: string, labels_url: string, id: int64, node_id: string, number: int64, title: string, description: string, creator: struct<login: string, id: int64, node_id: string, avatar_url: string, gravatar_id: string, url: string, html_url: string, followers_url: string, following_url: string, gists_url: string, starred_url: string, subscriptions_url: string, organizations_url: string, repos_url: string, events_url: string, received_events_url: string, type: string, site_admin: bool>, open_issues: int64, closed_issues: int64, state: string, created_at: timestamp[s], updated_at: timestamp[s], due_on: null, closed_at: timestamp[s]>

I’ve already searched for what to do but am at a loss at the moment.

Any ideas?


Hey @Evan thanks for reporting this error! It looks like it might be a low-level problem with the way we parse JSON files in datasets. Would you mind uploading your dataset to the Hub and sharing it here so I can try to reproduce it on my side?


Hey @lewtun! A thousand pardons for not responding sooner.

I’m actually having a really hard time reproducing the error myself: when I set num_issues to 2_500 or 5_000, everything runs just fine. However, when I bump it up to 10_000 in the example, I get the error above :confounded:

As for loading the data to the hub, I’ve uploaded the .json file here: Evan/spaCy-github-issues · Datasets at Hugging Face


1 Like

Thanks for sharing the dataset @Evan! I was able to reproduce your error, so have opened an issue on the datasets repo here: TypeError: Couldn't cast array of type for JSONLines dataset · Issue #3965 · huggingface/datasets · GitHub

1 Like

Hello everyone,

I am very new to the topic, so sorry if this question is obvious.

I’d like to start working on this task (Chapter 5 - Time to slice and dice):

  1. Use the techniques from Chapter 3 to train a classifier that can predict the patient condition based on the drug review.

Since this label (patient condition) is also a string (I think there are 819 unique conditions), what would be the best approach? I was thinking about tokenizing this field and then use a seq2seq model. Or maybe assign a number to each unique condition

Thanks for the great course!

Hey @juancopi81 what I had in mind was the second approach you describe - treat each condition as a label and try to train a multiclass classifier. Given so many labels, you might want to explore top-k accuracy as a metric, but the main goal of the exercise is to give you some practice training models in a new setting :slight_smile:

1 Like

Hi there,

the last “Try it out!” task here asks us

  1. to “create an own dataset of GitHub issues” and
  2. to “fine-tune a multilabel classifier” (for bonus points :wink:).

I have created this dataset. It has 57 different labels and an instance may be labelled with any combination of those. I would like to add the class label names ["bug", "benchmark", "performance", ...] to the dataset. Inspired by this forum post, I have tried the following, yet without success:

features = transformers_issues_labels.features.copy()
features["arr_labels"] = ClassLabel(names=unique_labels)
transformers_issues_labels =
    lambda batch: batch, batched=False, features=features

TypeError: Couldn't cast array of type list<item: int64> to int64

=> Two questions:

  1. How to build a classifier for this task (e.g. “MultiLabelFromPretrainedClassifier” or something like this…)?
  2. How can I add the class label names to my dataset (specifically to the “arr_labels” features, assuming this makes sense)?

P.s. In any case: Thanks a ton to all contributors of this course. I am learning a lot and looking forward to part 3.

Hey @mdroth as a hint, you can checkout the problem_type parameter of TrainingArguments - this allows you to configure the loss for multilabel problems :slight_smile:

You might also want to check out, the code associated with Chapter 9 of the our book, which covers a similar topic: notebooks/09_few-to-no-labels.ipynb at main · nlp-with-transformers/notebooks · GitHub


Hi @lewtun, unfortunately, I couldn’t find find a problem_type parameter in the documentation of TrainingArguments (I am using transformers.__version__ = '4.17.0'). I do not want to bomb this topic with my very specific issue, so I created a new topic here.

I am also still curious about adding the class label names to the dataset (my 2nd item).

Any help is much appreciated.


When doing:

from datasets import load_dataset

data_files = ""
pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True



I get the error:

When doing:


I get an empty list. Can you help me?

Apologies if this has been addressed elsewhere, but when I try to load the dataset, I got the below erro:

from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = ""
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")

ConnectionError: HTTPSConnectionPool(host=‘’, port=443): Max retries exceeded with url: /public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst (Caused by NewConnectionError(‘<urllib3.connection.HTTPSConnection object at 0x7f3bc5c6bd50>: Failed to establish a new connection: [Errno 111] Connection refused’))

I changed the url to

data_files = ""

and now the datasets loads successfully. I thought I might share it here in case anyone else got stuck there.

Thanks a lot @Teme - it seems like the Pile did indeed shift location! I’ve included your fix here: Fix URL to the Pile by lewtun · Pull Request #324 · huggingface/course · GitHub

Hi everyone!
I’m looking through the 5th chapter and just wanted to ask.
In the Creating your own dataset part when looping over the pages in the fetch_issues function is there a reason why it’s tqdm(range(num_pages)) instead of just trange(num_pages)?

Hello I’m getting a ValueError with the following line on colab.

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])

“ValueError: The features can’t be aligned because the key meta of features”