Chapter 5 questions

Mentatko · February 19, 2023, 6:08am

I have the same problem.
In the same time
df = pd.read_json(data_files, lines=True)
works fine to load the same file.

iotengtr · February 21, 2023, 3:08pm

I have the same problem also.

iotengtr · February 21, 2023, 3:33pm

@Mentatko ;
This statement creates a pandas dataframe but load_dataset creates a apache arrow. they have different attributes and methods.
For example, you can not execute the next statement on the same chapter.

sample = issues_dataset.shuffle(seed=666).select(range(3))

iotengtr · February 21, 2023, 3:40pm

I have solved the problem with the help of the ChatGPT
The problems is occurred because there are some NULL timestamps and

load_dataset()

can not handle them. We need to handle the issue manually.
You can replace the below part:

all_issues.extend(batch)

# Replace missing timestamp values with default value
for issue in all_issues:
    if "created_at" in issue and issue["created_at"] is None:
        issue["created_at"] = "1970-01-01T00:00:00Z"
    if "updated_at" in issue and issue["updated_at"] is None:
        issue["updated_at"] = "1970-01-01T00:00:00Z"

df = pd.DataFrame.from_records(all_issues)

Mentatko · February 23, 2023, 5:24am

Yes, true
In the same time after

df = pd.read_json(data_files, lines=True)

I used

from datasets import Dataset
tds = Dataset.from_pandas(df)

so I have same data format|

By the way I make my json work by removing fields
‘author_association’, ‘timeline_url’, ‘reactions’, ‘performed_via_github_app’
I don’t know why, but datetime fields like ‘created_at’, ‘updated_at’, 'closed_at did not trigger error.
I did

df = pd.read_json(data_files, lines=True)
from datasets import Dataset
tds = Dataset.from_pandas(df)
dataset = tds.remove_columns([ ‘author_association’, ‘timeline_url’, ‘reactions’, ‘performed_via_github_app’])
data_file_name = “datasets-issues-wo-cols.jsonl”
new_data_file= PROJECT_DIR+data_file_name
dataset.to_json(f"{new_data_file}", orient=“records”, lines=True)
issues_dataset = load_dataset(“json”, data_files=new_data_file)

iotengtr · February 23, 2023, 10:40am

@Mentatko ;
It is very strange that my solution does not work now.
My issue number was 2000 before. Changing the timestamps worked before but it did not worked with 10,000 issues.
I had to remove the columns ‘created_at’, ‘updated_at’, ‘closed_at’ with the fields you mentioned.

Thanks for the solution.

Mentatko · February 24, 2023, 5:49am

Update:
Add

, split=“train”

last line parameters, instead of:

issues_dataset = load_dataset(“json”, data_files=new_data_file)

use

issues_dataset = load_dataset(“json”, data_files=new_data_file, split=“train”)

So next line would not raise an error:

sample = issues_dataset.shuffle(seed=666).select(range(3))

NicoPZ · March 2, 2023, 2:56pm

Hi, first I would like to thank you for this amazing course. It’s been really helpfull. I have a question I couldn’t find the answer anywhere. In chapter 3 in ‘a full training’ you use pytorch Dataloader. From what I get this doesn’t use the mapping of the data in the disk. My question is how would be the final training without using the DataLoader and keeping the data in de Dataset object so to get use of the mapping capability. How can I get to load the data as batches in the model without the DataLoader.
I’m new to the Transformers’ library.
Thanks in advance.
Nico

tansaku · March 25, 2023, 10:43am

Thanks for the wonderful course - quick question about the batch mapping code:

new_drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

I seem to be able to run the code like this as well

drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])}, batched=True)

I understand the principle that a batch of 1000 data items are being operated on by default, but is the list comprehension actually necessary?

michalreal · April 20, 2023, 9:57pm

[Possible error in the course]

I think I found an error in your course. I’m not sure how to report it.
WHat is wrong in my opinion?
IN Semantic search with FAISS - Hugging Face Course
when you sort the best matches:
samples_df.sort_values(“scores”, ascending=False, inplace=True)

I believe it should be ascending=True.

When you change the k in:
scores, samples = embeddings_dataset.get_nearest_examples(
“embeddings”, question_embedding, k=5
)
to k=1 you will get the lowest score as a best match not the highest.
Please let me know if my thinking is correct

pavle-tsotskolauri · April 29, 2023, 11:45am

Hello

I am trying to complete “Creating your own dataset” tutorial but I can not load dataset after I pull issues from github

here is an error on line :

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-2cf51354fbcf9df8/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
Downloading data files: 100%
1/1 [00:00<00:00, 57.27it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 23.86it/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1872                         )
-> 1873                     writer.write_table(table)
   1874                     num_examples_progress_update += len(table)

17 frames
TypeError: Couldn't cast array of type timestamp[s] to null

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1889             if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1890                 e = e.__context__
-> 1891             raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1892 
   1893         yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

this is my copy of the notebook:

Thanks in advance!

MyColin · June 9, 2023, 3:31pm

I have to same error on Colab, have you fixed the error? the error like this:

ValueError Traceback (most recent call last)
in <cell line: 4>()
2 from datasets import interleave_datasets
3
----> 4 combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
5 list(islice(combined_dataset, 2))

2 frames
/usr/local/lib/python3.10/dist-packages/datasets/features/features.py in _check_if_features_can_be_aligned(features_list)
2092 for k, v in features.items():
2093 if not (isinstance(v, Value) and v.dtype == “null”) and name2feature[k] != v:
→ 2094 raise ValueError(
2095 f’The features can't be aligned because the key {k} of features {features} has unexpected type - {v} (expected either {name2feature[k]} or Value(“null”).’
2096 )

ValueError: The features can’t be aligned because the key meta of features {‘meta’: {‘case_ID’: Value(dtype=‘string’, id=None), ‘case_jurisdiction’: Value(dtype=‘string’, id=None), ‘date_created’: Value(dtype=‘string’, id=None)}, ‘text’: Value(dtype=‘string’, id=None)} has unexpected type - {‘case_ID’: Value(dtype=‘string’, id=None), ‘case_jurisdiction’: Value(dtype=‘string’, id=None), ‘date_created’: Value(dtype=‘string’, id=None)} (expected either {‘language’: Value(dtype=‘string’, id=None), ‘pmid’: Value(dtype=‘int64’, id=None)} or Value(“null”).

adenby · June 15, 2023, 7:30am

Hi michalreal, I’m so glad that someone else has picked up on this - thought I was going mad!

The issue that I’ve seen (and that I think you are referring to also) is easily visible by updating the concatenate_text function to only search the title e.g.

def concatenate_text(examples):
return {
“text”: examples[“title”]
}

And also changing the question to be single term search e.g. “failing”.

It’s then very obvious that the best hit always comes back with the lowest score.

Unfortunately, I can’t confirm whether it’s an issue or just a “feature” of how the scoring works (i.e. lower score is better) but seems pretty odd.

Did you (or anyone) get to the bottom of this?

Andy

wdavies · June 23, 2023, 10:41pm

I have the same issue - Google Colab notebook.

NamburiSrinath · June 27, 2023, 6:56pm

Hi @sgugger, @lewtun,

Thank you so much for this course, it’s super helpful

I am looking at Semantic Search using FAISS (https://huggingface.co/learn/nlp-course/chapter5/6?fw=pt) and am wondering what does the score represent when using get_nearest_examples. Can the documentation include 1-2 sentences regarding the score. Is it Euclidean Distance, Inner Product or something else?

I checked the source code (https://huggingface.co/docs/datasets/v1.4.0/_modules/datasets/arrow_dataset.html#Dataset.add_faiss_index) but was unfruitful

Greatly appreciate it!

NamburiSrinath · June 27, 2023, 6:59pm

Is this really ascending=True? Any updates

avishnyakova · June 29, 2023, 6:22pm

I agree that this is an error. We get Euclidean/L2 distance (IndexFlatL2 is the default in FAISS). We should get the response with the lowest score, no sorting is required.

ydshieh · July 11, 2023, 9:42am

cc @lewtun maybe?

wesamkhallaf · July 20, 2023, 8:45am

hello , I am trying to follow this part of the course and trying to download Pub med dataset , but it seems that https://the-eye.eu/. no longer host the dataset ,
I also browsed the https://the-eye.eu/public/AI where I found no folder with name pile_preliminary_components. ,
thank you for your advice
Wesam

ValdanCS · July 22, 2023, 3:48pm

Hello team! Thank you so much for your work. I have learned a lot through the course.

I see there’s been previous changes to the location of the repository. I think there has been a new one (July 2023) including a change in the dataset name, and I think it’s now on https://the-eye.eu/public/AI/pile_neox/data/PubMedCentralDataset_text_document.bin.

Could you confirm this is the right file?

Thank you again!

Topic		Replies	Views
The 🤗 Datasets library - Hugging Face Course 🤗Datasets	1	567	November 25, 2021
Got wrong row number of dataset viewer 🤗Hub	11	600	June 26, 2024
Undesired behavior when using load_dataset 🤗Datasets	4	945	April 17, 2023
Mapping 1 multi-element column of a dataset to multi row dataset with 1 element per row, duplicating other features 🤗Datasets	6	2527	November 4, 2022
How to operate on columns of a dataset Beginners	2	143	January 30, 2025

Chapter 5 questions

I have to same error on Colab, have you fixed the error? the error like this:

Related topics