Chapter 5 questions

I have the same problem.
In the same time
df = pd.read_json(data_files, lines=True)
works fine to load the same file.

I have the same problem also.

@Mentatko ;
This statement creates a pandas dataframe but load_dataset creates a apache arrow. they have different attributes and methods.
For example, you can not execute the next statement on the same chapter.

sample = issues_dataset.shuffle(seed=666).select(range(3))

I have solved the problem with the help of the ChatGPT :grinning:
The problems is occurred because there are some NULL timestamps and

load_dataset()

can not handle them. We need to handle the issue manually.
You can replace the below part:

all_issues.extend(batch)

# Replace missing timestamp values with default value
for issue in all_issues:
    if "created_at" in issue and issue["created_at"] is None:
        issue["created_at"] = "1970-01-01T00:00:00Z"
    if "updated_at" in issue and issue["updated_at"] is None:
        issue["updated_at"] = "1970-01-01T00:00:00Z"

df = pd.DataFrame.from_records(all_issues)

Yes, true
In the same time after

df = pd.read_json(data_files, lines=True)

I used

from datasets import Dataset
tds = Dataset.from_pandas(df)

so I have same data format|

By the way I make my json work by removing fields
‘author_association’, ‘timeline_url’, ‘reactions’, ‘performed_via_github_app’
I don’t know why, but datetime fields like ‘created_at’, ‘updated_at’, 'closed_at did not trigger error.
I did

df = pd.read_json(data_files, lines=True)
from datasets import Dataset
tds = Dataset.from_pandas(df)
dataset = tds.remove_columns([ ‘author_association’, ‘timeline_url’, ‘reactions’, ‘performed_via_github_app’])
data_file_name = “datasets-issues-wo-cols.jsonl”
new_data_file= PROJECT_DIR+data_file_name
dataset.to_json(f"{new_data_file}", orient=“records”, lines=True)
issues_dataset = load_dataset(“json”, data_files=new_data_file)

2 Likes

@Mentatko ;
It is very strange that my solution does not work now. :sweat_smile:
My issue number was 2000 before. Changing the timestamps worked before but it did not worked with 10,000 issues.
I had to remove the columns ‘created_at’, ‘updated_at’, ‘closed_at’ with the fields you mentioned.

Thanks for the solution.

1 Like

Update:
Add

, split=“train”

last line parameters, instead of:

issues_dataset = load_dataset(“json”, data_files=new_data_file)

use

issues_dataset = load_dataset(“json”, data_files=new_data_file, split=“train”)

So next line would not raise an error:

sample = issues_dataset.shuffle(seed=666).select(range(3))

Hi, first I would like to thank you for this amazing course. It’s been really helpfull. I have a question I couldn’t find the answer anywhere. In chapter 3 in ‘a full training’ you use pytorch Dataloader. From what I get this doesn’t use the mapping of the data in the disk. My question is how would be the final training without using the DataLoader and keeping the data in de Dataset object so to get use of the mapping capability. How can I get to load the data as batches in the model without the DataLoader.
I’m new to the Transformers’ library.
Thanks in advance.
Nico

Thanks for the wonderful course - quick question about the batch mapping code:

new_drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

I seem to be able to run the code like this as well

drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])}, batched=True)

I understand the principle that a batch of 1000 data items are being operated on by default, but is the list comprehension actually necessary?

[Possible error in the course]

I think I found an error in your course. I’m not sure how to report it.
WHat is wrong in my opinion?
IN Semantic search with FAISS - Hugging Face Course
when you sort the best matches:
samples_df.sort_values(“scores”, ascending=False, inplace=True)

I believe it should be ascending=True.

When you change the k in:
scores, samples = embeddings_dataset.get_nearest_examples(
“embeddings”, question_embedding, k=5
)
to k=1 you will get the lowest score as a best match not the highest.
Please let me know if my thinking is correct

Hello

I am trying to complete “Creating your own dataset” tutorial but I can not load dataset after I pull issues from github

here is an error on line :

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-2cf51354fbcf9df8/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
Downloading data files: 100%
1/1 [00:00<00:00, 57.27it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 23.86it/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1872                         )
-> 1873                     writer.write_table(table)
   1874                     num_examples_progress_update += len(table)

17 frames
TypeError: Couldn't cast array of type timestamp[s] to null

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1889             if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1890                 e = e.__context__
-> 1891             raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1892 
   1893         yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

this is my copy of the notebook:

Thanks in advance!

1 Like

I have to same error on Colab, have you fixed the error? the error like this:

ValueError Traceback (most recent call last)
in <cell line: 4>()
2 from datasets import interleave_datasets
3
----> 4 combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
5 list(islice(combined_dataset, 2))

2 frames
/usr/local/lib/python3.10/dist-packages/datasets/features/features.py in _check_if_features_can_be_aligned(features_list)
2092 for k, v in features.items():
2093 if not (isinstance(v, Value) and v.dtype == “null”) and name2feature[k] != v:
→ 2094 raise ValueError(
2095 f’The features can't be aligned because the key {k} of features {features} has unexpected type - {v} (expected either {name2feature[k]} or Value(“null”).’
2096 )

ValueError: The features can’t be aligned because the key meta of features {‘meta’: {‘case_ID’: Value(dtype=‘string’, id=None), ‘case_jurisdiction’: Value(dtype=‘string’, id=None), ‘date_created’: Value(dtype=‘string’, id=None)}, ‘text’: Value(dtype=‘string’, id=None)} has unexpected type - {‘case_ID’: Value(dtype=‘string’, id=None), ‘case_jurisdiction’: Value(dtype=‘string’, id=None), ‘date_created’: Value(dtype=‘string’, id=None)} (expected either {‘language’: Value(dtype=‘string’, id=None), ‘pmid’: Value(dtype=‘int64’, id=None)} or Value(“null”).

1 Like

Hi michalreal, I’m so glad that someone else has picked up on this - thought I was going mad!

The issue that I’ve seen (and that I think you are referring to also) is easily visible by updating the concatenate_text function to only search the title e.g.

def concatenate_text(examples):
return {
“text”: examples[“title”]
}

And also changing the question to be single term search e.g. “failing”.

It’s then very obvious that the best hit always comes back with the lowest score.

Unfortunately, I can’t confirm whether it’s an issue or just a “feature” of how the scoring works (i.e. lower score is better) but seems pretty odd.

Did you (or anyone) get to the bottom of this?

Andy

I have the same issue - Google Colab notebook.

Hi @sgugger, @lewtun,

Thank you so much for this course, it’s super helpful :slight_smile:

I am looking at Semantic Search using FAISS (https://huggingface.co/learn/nlp-course/chapter5/6?fw=pt) and am wondering what does the score represent when using get_nearest_examples. Can the documentation include 1-2 sentences regarding the score. Is it Euclidean Distance, Inner Product or something else?

I checked the source code (https://huggingface.co/docs/datasets/v1.4.0/_modules/datasets/arrow_dataset.html#Dataset.add_faiss_index) but was unfruitful :frowning:

Greatly appreciate it!

Is this really ascending=True? Any updates

I agree that this is an error. We get Euclidean/L2 distance (IndexFlatL2 is the default in FAISS). We should get the response with the lowest score, no sorting is required.

1 Like

cc @lewtun maybe?

hello , I am trying to follow this part of the course and trying to download Pub med dataset , but it seems that https://the-eye.eu/. no longer host the dataset ,
I also browsed the https://the-eye.eu/public/AI where I found no folder with name pile_preliminary_components. ,
thank you for your advice
Wesam

Hello :hugs:team! Thank you so much for your work. I have learned a lot through the course.

I see there’s been previous changes to the location of the repository. I think there has been a new one (July 2023) including a change in the dataset name, and I think it’s now on https://the-eye.eu/public/AI/pile_neox/data/PubMedCentralDataset_text_document.bin.

Could you confirm this is the right file?

Thank you again!

1 Like