Thanks for sharing the dataset @Evan! I was able to reproduce your error, so have opened an issue on the datasets
repo here: TypeError: Couldn't cast array of type for JSONLines dataset · Issue #3965 · huggingface/datasets · GitHub
Hello everyone,
I am very new to the topic, so sorry if this question is obvious.
I’d like to start working on this task (Chapter 5 - Time to slice and dice):
- Use the techniques from Chapter 3 to train a classifier that can predict the patient condition based on the drug review.
Since this label (patient condition) is also a string (I think there are 819 unique conditions), what would be the best approach? I was thinking about tokenizing this field and then use a seq2seq model. Or maybe assign a number to each unique condition
Thanks for the great course!
Hey @juancopi81 what I had in mind was the second approach you describe - treat each condition as a label and try to train a multiclass classifier. Given so many labels, you might want to explore top-k accuracy as a metric, but the main goal of the exercise is to give you some practice training models in a new setting
Hi there,
the last “Try it out!” task here asks us
- to “create an own dataset of GitHub issues” and
- to “fine-tune a multilabel classifier” (for bonus points
).
I have created this dataset. It has 57 different labels and an instance may be labelled with any combination of those. I would like to add the class label names ["bug", "benchmark", "performance", ...]
to the dataset. Inspired by this forum post, I have tried the following, yet without success:
features = transformers_issues_labels.features.copy()
features["arr_labels"] = ClassLabel(names=unique_labels)
transformers_issues_labels = transformers_issues_labels.map(
lambda batch: batch, batched=False, features=features
)
TypeError: Couldn't cast array of type list<item: int64> to int64
=> Two questions:
- How to build a classifier for this task (e.g. “MultiLabelFromPretrainedClassifier” or something like this…)?
- How can I add the class label names to my dataset (specifically to the “arr_labels” features, assuming this makes sense)?
P.s. In any case: Thanks a ton to all contributors of this course. I am learning a lot and looking forward to part 3.
Hey @mdroth as a hint, you can checkout the problem_type
parameter of TrainingArguments
- this allows you to configure the loss for multilabel problems
You might also want to check out, the code associated with Chapter 9 of the our book, which covers a similar topic: notebooks/09_few-to-no-labels.ipynb at main · nlp-with-transformers/notebooks · GitHub
HTH!
Hi @lewtun, unfortunately, I couldn’t find find a problem_type
parameter in the documentation of TrainingArguments
(I am using transformers.__version__ = '4.17.0'
). I do not want to bomb this topic with my very specific issue, so I created a new topic here.
I am also still curious about adding the class label names to the dataset (my 2nd item).
Any help is much appreciated.
When doing:
from datasets import load_dataset
data_files = "https://mystic.the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset_streamed = load_dataset(
"json", data_files=data_files, split="train", streaming=True
)
and
next(iter(pubmed_dataset_streamed))
I get the error:
StopIteration:
When doing:
list(pubmed_dataset_streamed)
I get an empty list. Can you help me?
Apologies if this has been addressed elsewhere, but when I try to load the dataset, I got the below erro:
from datasets import load_dataset
# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://mystic.the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset
ConnectionError: HTTPSConnectionPool(host=‘mystic.the-eye.eu’, port=443): Max retries exceeded with url: /public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst (Caused by NewConnectionError(‘<urllib3.connection.HTTPSConnection object at 0x7f3bc5c6bd50>: Failed to establish a new connection: [Errno 111] Connection refused’))
I changed the url to
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
and now the datasets loads successfully. I thought I might share it here in case anyone else got stuck there.
Thanks a lot @Teme - it seems like the Pile did indeed shift location! I’ve included your fix here: Fix URL to the Pile by lewtun · Pull Request #324 · huggingface/course · GitHub
Hi everyone!
I’m looking through the 5th chapter and just wanted to ask.
In the Creating your own dataset part when looping over the pages in the fetch_issues
function is there a reason why it’s tqdm(range(num_pages))
instead of just trange(num_pages)
?
Hello I’m getting a ValueError with the following line on colab.
combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
“ValueError: The features can’t be aligned because the key meta of features”
Thank you for your great effort.
I just faced an error after request data-issues from Github , i can’t load data
any advice?
Thank you in advance
In the section: semantic hashing with FAISS, I’m using .map() to apply the get_embeddings() function to every entry in my own dataset. My dataset is very large, so I am trying to specify num_proc >0 in map() to parallelize the process. However, the program deadlocks and stalls without going through even one datapoint. How should I modify get_embeddings() so it supports parallelism? Thanks!
I am facing issues understanding return_overflowing_tokens
parameter from tokenizer.
Can someone explain me more about what overflowing tokens are and what it means?
I got it now
I have an error executing:
issues_dataset = load_dataset(“json”, data_files=“datasets-issues.jsonl”, split=“train”)
I just open a notebook and run everything and receive following error:
TypeError: Couldn’t cast array of type
struct<url: string, html_url: string, labels_url: string, id: int64, node_id: string, number: int64, title: string, description: string, creator: struct<login: string, id: int64, node_id: string, avatar_url: string, gravatar_id: string, url: string, html_url: string, followers_url: string, following_url: string, gists_url: string, starred_url: string, subscriptions_url: string, organizations_url: string, repos_url: string, events_url: string, received_events_url: string, type: string, site_admin: bool>, open_issues: int64, closed_issues: int64, state: string, created_at: timestamp[s], updated_at: timestamp[s], due_on: timestamp[s], closed_at: timestamp[s]>
to
null
(…)
DatasetGenerationError: An error occurred while generating the dataset
How to solve this problem?
I have the same problem.
Hello, thanks a lot for this tutorial. Is there any way to push the search engine created to HuggingFace Hub and then use the Inference API to make calls for similarity prediction ?
Actually I stored my embeddings in .pickle file as you did. How can I proceed to create an inference end point to call this for similarity search ?
I also have this problem.