Chapter 5 questions

Hi @wesamkhallaf and @ValdanCS, please check out this dataset.

You can use the links listed in the Usage section as drop-in replacements for the links of the pile in Section 5. That’s the intended purpose of that dataset since I had exactly the same problem.

Please let me know whether it works for you, too!

Best wishes,
Matthias

I had the same error.
I don’t know the details but I can make the dataset using pandas package as below.

import pandas as pd
from datasets import Dataset
df = pd.read_json('datasets-issues.jsonl', orient='records', lines=True)
issues_dataset = Dataset.from_pandas(df, split="train")
issues_dataset
1 Like
from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset

---------------------------------------------------------------------------
SchemaInferenceError                      Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1948                 num_shards = shard_id + 1
-> 1949                 num_examples, num_bytes = writer.finalize()
   1950                 writer.close()

6 frames
SchemaInferenceError: Please pass `features` or at least one example when writing data

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1956             if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1957                 e = e.__context__
-> 1958             raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1959 
   1960         yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

Hello, I am getting this error on Google Colab and I haven’t found any workaround for this.

3 Likes

Getting the same error

This url does not exist now https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst

1 Like

Maybe My datasets is a book, Can I custom datasets from txt files?

Same error here. Is the data available anywhere else?

:pencil2: Try it out! Use the Dataset.unique() function to find the number of unique drugs and conditions in the training and test sets.

Where can I find the detail usage of Dataset.unique() function? This is the first time I saw this function

*** pubmed_dataset_streamed and law_dataset_streamed can not be combined by islice() function**

from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))

ValueError: The features can’t be aligned because the key meta of features {‘meta’: {‘case_ID’: Value(dtype=‘string’, id=None), ‘case_jurisdiction’: Value(dtype=‘string’, id=None), ‘date_created’: Value(dtype=‘string’, id=None)}, ‘text’: Value(dtype=‘string’, id=None)} has unexpected type - {‘case_ID’: Value(dtype=‘string’, id=None), ‘case_jurisdiction’: Value(dtype=‘string’, id=None), ‘date_created’: Value(dtype=‘string’, id=None)} (expected either {‘pmid’: Value(dtype=‘int64’, id=None), ‘language’: Value(dtype=‘string’, id=None)} or Value(“null”).

*** TypeError: Couldn’t cast array of type timestamp[s] to null**

from datasets import load_dataset
issues_dataset = load_dataset(“json”, data_files=“datasets-issues.jsonl”, split=“train”)
issues_dataset
Downloading data files: 100%

1/1 [00:00<00:00, 37.80it/s]

Extracting data files: 100%

1/1 [00:00<00:00, 44.29it/s]

Generating train split:

2584/0 [00:01<00:00, 3104.08 examples/s]

TypeError: Couldn’t cast array of type timestamp[s] to null

The above exception was the direct cause of the following exception:

DatasetGenerationError: An error occurred while generating the dataset

I divided the datasets-issues.jsonl into two files, and find each file can split correctly:

from datasets import load_dataset

issues_dataset_1 = load_dataset("json", data_files="datasets-issues-1.jsonl", split="train")

issues_dataset_1

Dataset({
features: [‘url’, ‘repository_url’, ‘labels_url’, ‘comments_url’, ‘events_url’, ‘html_url’, ‘id’, ‘node_id’, ‘number’, ‘title’, ‘user’, ‘labels’, ‘state’, ‘locked’, ‘assignee’, ‘assignees’, ‘milestone’, ‘comments’, ‘created_at’, ‘updated_at’, ‘closed_at’, ‘author_association’, ‘active_lock_reason’, ‘draft’, ‘pull_request’, ‘body’, ‘reactions’, ‘timeline_url’, ‘performed_via_github_app’, ‘state_reason’],
num_rows: 2884
})

from datasets import load_dataset

issues_dataset_2 = load_dataset("json", data_files="datasets-issues-2.jsonl", split="train")

issues_dataset_2

Dataset({
features: [‘url’, ‘repository_url’, ‘labels_url’, ‘comments_url’, ‘events_url’, ‘html_url’, ‘id’, ‘node_id’, ‘number’, ‘title’, ‘user’, ‘labels’, ‘state’, ‘locked’, ‘assignee’, ‘assignees’, ‘milestone’, ‘comments’, ‘created_at’, ‘updated_at’, ‘closed_at’, ‘author_association’, ‘active_lock_reason’, ‘draft’, ‘pull_request’, ‘body’, ‘reactions’, ‘timeline_url’, ‘performed_via_github_app’, ‘state_reason’],
num_rows: 3624
})

I try to combined issues_dataset_1 and issues_dataset_2 into issues_dataset, but did not succeed.
I decided to use issues_dataset_2 as issues_dataset, since I had waste too much time on this trivial matter

Use this link instead:

data_files = “https://mystic.the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst

1 Like

In the “Big Data” section of this chapter, the sample notebook tries to retrieve PubMed data from a URL that is no longer working. Is there another URL that we can use? If not, is there a similar dataset that we can retrieve from “the-eye.eu” domain?

Thank you!

In the section about embedding, the model multi-qa-mpnet-base-dot-v1 is referenced. However in the documentation, it says " Suitable models for asymmetric semantic search : Pre-Trained MS MARCO Models"

Why did we not use theb est model from this family?

In “Time to Slice and Dice,” I cannot get the parallel dataset mapping to work:

drug_dataset.map(tokenize_function, batched=True)

runs without issue, but:

drug_dataset.map(tokenize_function, batched=True, num_proc=4) 

causes a crash with the message:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "c:\Users\964864\OneDrive - Cognizant HealthCare\Documents\Innovation Project\HuggingFace NLP Tutorial\notebooks\.venv\lib\site-packages\multiprocess\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "c:\Users\964864\OneDrive - Cognizant HealthCare\Documents\Innovation Project\HuggingFace NLP Tutorial\notebooks\.venv\lib\site-packages\datasets\utils\py_utils.py", line 1377, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "c:\Users\964864\OneDrive - Cognizant HealthCare\Documents\Innovation Project\HuggingFace NLP Tutorial\notebooks\.venv\lib\site-packages\datasets\arrow_dataset.py", line 3466, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "c:\Users\964864\OneDrive - Cognizant HealthCare\Documents\Innovation Project\HuggingFace NLP Tutorial\notebooks\.venv\lib\site-packages\datasets\arrow_dataset.py", line 3345, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "C:\Users\964864\AppData\Local\Temp\1\ipykernel_18204\2403223452.py", line 6, in tokenize_function
NameError: name 'tokenizer' is not defined
"""

The above exception was the direct cause of the following exception:

NameError                                 Traceback (most recent call last)
File <timed exec>:1

File c:\Users\964864\OneDrive - Cognizant HealthCare\Documents\Innovation Project\HuggingFace NLP Tutorial\notebooks\.venv\lib\site-packages\datasets\dataset_dict.py:855, in DatasetDict.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_names, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, desc)
    852 if cache_file_names is None:
    853     cache_file_names = {k: None for k in self}
    854 return DatasetDict(
--> 855     {
    856         k: dataset.map(
    857             function=function,
    858             with_indices=with_indices,
    859             with_rank=with_rank,
    860             input_columns=input_columns,
    861             batched=batched,
    862             batch_size=batch_size,
    863             drop_last_batch=drop_last_batch,
    864             remove_columns=remove_columns,
    865             keep_in_memory=keep_in_memory,
    866             load_from_cache_file=load_from_cache_file,
    867             cache_file_name=cache_file_names[k],
    868             writer_batch_size=writer_batch_size,
    869             features=features,
    870             disable_nullable=disable_nullable,
    871             fn_kwargs=fn_kwargs,
    872             num_proc=num_proc,
    873             desc=desc,
    874         )
    875         for k, dataset in self.items()
    876     }
    877 )

File c:\Users\964864\OneDrive - Cognizant HealthCare\Documents\Innovation Project\HuggingFace NLP Tutorial\notebooks\.venv\lib\site-packages\datasets\dataset_dict.py:856, in <dictcomp>(.0)
    852 if cache_file_names is None:
    853     cache_file_names = {k: None for k in self}
    854 return DatasetDict(
    855     {
--> 856         k: dataset.map(
    857             function=function,
    858             with_indices=with_indices,
    859             with_rank=with_rank,
    860             input_columns=input_columns,
    861             batched=batched,
    862             batch_size=batch_size,
    863             drop_last_batch=drop_last_batch,
    864             remove_columns=remove_columns,
    865             keep_in_memory=keep_in_memory,
    866             load_from_cache_file=load_from_cache_file,
    867             cache_file_name=cache_file_names[k],
    868             writer_batch_size=writer_batch_size,
    869             features=features,
    870             disable_nullable=disable_nullable,
    871             fn_kwargs=fn_kwargs,
    872             num_proc=num_proc,
    873             desc=desc,
    874         )
    875         for k, dataset in self.items()
    876     }
    877 )

File c:\Users\964864\OneDrive - Cognizant HealthCare\Documents\Innovation Project\HuggingFace NLP Tutorial\notebooks\.venv\lib\site-packages\datasets\arrow_dataset.py:591, in transmit_tasks.<locals>.wrapper(*args, **kwargs)
    589     self: "Dataset" = kwargs.pop("self")
    590 # apply actual function
--> 591 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    592 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    593 for dataset in datasets:
    594     # Remove task templates if a column mapping of the template is no longer valid

File c:\Users\964864\OneDrive - Cognizant HealthCare\Documents\Innovation Project\HuggingFace NLP Tutorial\notebooks\.venv\lib\site-packages\datasets\arrow_dataset.py:556, in transmit_format.<locals>.wrapper(*args, **kwargs)
    549 self_format = {
    550     "type": self._format_type,
    551     "format_kwargs": self._format_kwargs,
    552     "columns": self._format_columns,
    553     "output_all_columns": self._output_all_columns,
    554 }
    555 # apply actual function
--> 556 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    557 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    558 # re-apply format to the output

File c:\Users\964864\OneDrive - Cognizant HealthCare\Documents\Innovation Project\HuggingFace NLP Tutorial\notebooks\.venv\lib\site-packages\datasets\arrow_dataset.py:3181, in Dataset.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
   3174 logger.info(f"Spawning {num_proc} processes")
   3175 with logging.tqdm(
   3176     disable=not logging.is_progress_bar_enabled(),
   3177     unit=" examples",
   3178     total=pbar_total,
   3179     desc=(desc or "Map") + f" (num_proc={num_proc})",
   3180 ) as pbar:
-> 3181     for rank, done, content in iflatmap_unordered(
   3182         pool, Dataset._map_single, kwargs_iterable=kwargs_per_job
   3183     ):
   3184         if done:
   3185             shards_done += 1

File c:\Users\964864\OneDrive - Cognizant HealthCare\Documents\Innovation Project\HuggingFace NLP Tutorial\notebooks\.venv\lib\site-packages\datasets\utils\py_utils.py:1417, in iflatmap_unordered(pool, func, kwargs_iterable)
   1414 finally:
   1415     if not pool_changed:
   1416         # we get the result in case there's an error to raise
-> 1417         [async_result.get(timeout=0.05) for async_result in async_results]

File c:\Users\964864\OneDrive - Cognizant HealthCare\Documents\Innovation Project\HuggingFace NLP Tutorial\notebooks\.venv\lib\site-packages\datasets\utils\py_utils.py:1417, in <listcomp>(.0)
   1414 finally:
   1415     if not pool_changed:
   1416         # we get the result in case there's an error to raise
-> 1417         [async_result.get(timeout=0.05) for async_result in async_results]

File c:\Users\964864\OneDrive - Cognizant HealthCare\Documents\Innovation Project\HuggingFace NLP Tutorial\notebooks\.venv\lib\site-packages\multiprocess\pool.py:771, in ApplyResult.get(self, timeout)
    769     return self._value
    770 else:
--> 771     raise self._value

NameError: name 'tokenizer' is not defined

What is going on? I did a basic google search and saw that I might need to set the number of threads in torch. Neither torch.set_num_threads(4) nor torch.set_num_threads(1) fixed the issue.

Hello,
Thanks for great explanations. I am doing all of the NLP courses.
In chapter 5 “Time to slice and dice”, when I load medical record TSVs,
“drug_dataset = load_dataset(“csv”, data_files=data_files, delimiter=‘\t’)”
" TypeError: read_csv() got an unexpected keyword argument ‘mangle_dupe_cols’ The above exception was the direct cause of the following exception:

DatasetGenerationError: An error occurred while generating the dataset"
I get these errors. I have been using Hugging Face and Pandas and had no issues with others. I tried to download the datasets manually and unzipped to load, but got the same issue. I do not understand what mangle_dupe_cols.

FYI, my datasets version is 2.10.0, and pandas’ version is 2.1.1. I found a similar posting:[BUG] With Pandas 2.0.0, `load_dataset` raises `TypeError: read_csv() got an unexpected keyword argument 'mangle_dupe_cols'` · Issue #5744 · huggingface/datasets · GitHub but this was for earlier versions.

following link is not working ‘https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst

and also tried alternative provided in discussions, but it is also not working ‘https://mystic.the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst

1 Like

The model used in the “Semantic search with FAISS” section, multi-qa-mpnet-base-dot-v1, has been tuned for dot-product similarity, whereas FAISS uses cosine similarity. Is this an issue? If so, how should we calculate the dot-product similarity instead?

I have been experimenting with the Semantic search with FAISS method in Chapter 5. I divided a document into sentences and used Semantic Search to answer questions. Often the answers were very good but not up to human standards. I went through the document and made sure each sentence was “stand alone” and did not require context for understanding, for example replacing pronouns with the proper nouns. The results were much better 100% of my test questions were accurately answered. So my suggestion is to prepare documents used for semantic search if at all practical.

Try it out! Compute the average rating per drug and store the result in a new Dataset.

How to work on it? Can you give any guidance? Thank you.

Try to use pandas method to handle it, I recommend using groupby()