I have the same problem.
In the same time
df = pd.read_json(data_files, lines=True)
works fine to load the same file.
I have the same problem also.
@Mentatko ;
This statement creates a pandas dataframe but load_dataset creates a apache arrow. they have different attributes and methods.
For example, you can not execute the next statement on the same chapter.
sample = issues_dataset.shuffle(seed=666).select(range(3))
I have solved the problem with the help of the ChatGPT
The problems is occurred because there are some NULL timestamps and
load_dataset()
can not handle them. We need to handle the issue manually.
You can replace the below part:
all_issues.extend(batch) # Replace missing timestamp values with default value for issue in all_issues: if "created_at" in issue and issue["created_at"] is None: issue["created_at"] = "1970-01-01T00:00:00Z" if "updated_at" in issue and issue["updated_at"] is None: issue["updated_at"] = "1970-01-01T00:00:00Z" df = pd.DataFrame.from_records(all_issues)
Yes, true
In the same time after
df = pd.read_json(data_files, lines=True)
I used
from datasets import Dataset
tds = Dataset.from_pandas(df)
so I have same data format|
By the way I make my json work by removing fields
âauthor_associationâ, âtimeline_urlâ, âreactionsâ, âperformed_via_github_appâ
I donât know why, but datetime fields like âcreated_atâ, âupdated_atâ, 'closed_at did not trigger error.
I did
df = pd.read_json(data_files, lines=True)
from datasets import Dataset
tds = Dataset.from_pandas(df)
dataset = tds.remove_columns([ âauthor_associationâ, âtimeline_urlâ, âreactionsâ, âperformed_via_github_appâ])
data_file_name = âdatasets-issues-wo-cols.jsonlâ
new_data_file= PROJECT_DIR+data_file_name
dataset.to_json(f"{new_data_file}", orient=ârecordsâ, lines=True)
issues_dataset = load_dataset(âjsonâ, data_files=new_data_file)
@Mentatko ;
It is very strange that my solution does not work now.
My issue number was 2000 before. Changing the timestamps worked before but it did not worked with 10,000 issues.
I had to remove the columns âcreated_atâ, âupdated_atâ, âclosed_atâ with the fields you mentioned.
Thanks for the solution.
Update:
Add
, split=âtrainâ
last line parameters, instead of:
issues_dataset = load_dataset(âjsonâ, data_files=new_data_file)
use
issues_dataset = load_dataset(âjsonâ, data_files=new_data_file, split=âtrainâ)
So next line would not raise an error:
sample = issues_dataset.shuffle(seed=666).select(range(3))
Hi, first I would like to thank you for this amazing course. Itâs been really helpfull. I have a question I couldnât find the answer anywhere. In chapter 3 in âa full trainingâ you use pytorch Dataloader. From what I get this doesnât use the mapping of the data in the disk. My question is how would be the final training without using the DataLoader and keeping the data in de Dataset object so to get use of the mapping capability. How can I get to load the data as batches in the model without the DataLoader.
Iâm new to the Transformersâ library.
Thanks in advance.
Nico
Thanks for the wonderful course - quick question about the batch mapping code:
new_drug_dataset = drug_dataset.map(
lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)
I seem to be able to run the code like this as well
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])}, batched=True)
I understand the principle that a batch of 1000 data items are being operated on by default, but is the list comprehension actually necessary?
[Possible error in the course]
I think I found an error in your course. Iâm not sure how to report it.
WHat is wrong in my opinion?
IN Semantic search with FAISS - Hugging Face Course
when you sort the best matches:
samples_df.sort_values(âscoresâ, ascending=False, inplace=True)
I believe it should be ascending=True.
When you change the k in:
scores, samples = embeddings_dataset.get_nearest_examples(
âembeddingsâ, question_embedding, k=5
)
to k=1 you will get the lowest score as a best match not the highest.
Please let me know if my thinking is correct
Hello
I am trying to complete âCreating your own datasetâ tutorial but I can not load dataset after I pull issues from github
here is an error on line :
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-2cf51354fbcf9df8/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
Downloading data files: 100%
1/1 [00:00<00:00, 57.27it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 23.86it/s]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
1872 )
-> 1873 writer.write_table(table)
1874 num_examples_progress_update += len(table)
17 frames
TypeError: Couldn't cast array of type timestamp[s] to null
The above exception was the direct cause of the following exception:
DatasetGenerationError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
1889 if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
1890 e = e.__context__
-> 1891 raise DatasetGenerationError("An error occurred while generating the dataset") from e
1892
1893 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)
DatasetGenerationError: An error occurred while generating the dataset
this is my copy of the notebook:
Thanks in advance!
I have to same error on Colab, have you fixed the error? the error like this:
ValueError Traceback (most recent call last)
in <cell line: 4>()
2 from datasets import interleave_datasets
3
----> 4 combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
5 list(islice(combined_dataset, 2))
2 frames
/usr/local/lib/python3.10/dist-packages/datasets/features/features.py in _check_if_features_can_be_aligned(features_list)
2092 for k, v in features.items():
2093 if not (isinstance(v, Value) and v.dtype == ânullâ) and name2feature[k] != v:
â 2094 raise ValueError(
2095 fâThe features can't be aligned because the key {k} of features {features} has unexpected type - {v} (expected either {name2feature[k]} or Value(ânullâ).â
2096 )
ValueError: The features canât be aligned because the key meta of features {âmetaâ: {âcase_IDâ: Value(dtype=âstringâ, id=None), âcase_jurisdictionâ: Value(dtype=âstringâ, id=None), âdate_createdâ: Value(dtype=âstringâ, id=None)}, âtextâ: Value(dtype=âstringâ, id=None)} has unexpected type - {âcase_IDâ: Value(dtype=âstringâ, id=None), âcase_jurisdictionâ: Value(dtype=âstringâ, id=None), âdate_createdâ: Value(dtype=âstringâ, id=None)} (expected either {âlanguageâ: Value(dtype=âstringâ, id=None), âpmidâ: Value(dtype=âint64â, id=None)} or Value(ânullâ).
Hi michalreal, Iâm so glad that someone else has picked up on this - thought I was going mad!
The issue that Iâve seen (and that I think you are referring to also) is easily visible by updating the concatenate_text function to only search the title e.g.
def concatenate_text(examples):
return {
âtextâ: examples[âtitleâ]
}
And also changing the question to be single term search e.g. âfailingâ.
Itâs then very obvious that the best hit always comes back with the lowest score.
Unfortunately, I canât confirm whether itâs an issue or just a âfeatureâ of how the scoring works (i.e. lower score is better) but seems pretty odd.
Did you (or anyone) get to the bottom of this?
Andy
Thank you so much for this course, itâs super helpful
I am looking at Semantic Search using FAISS (https://huggingface.co/learn/nlp-course/chapter5/6?fw=pt) and am wondering what does the score represent when using get_nearest_examples
. Can the documentation include 1-2 sentences regarding the score. Is it Euclidean Distance, Inner Product or something else?
I checked the source code (https://huggingface.co/docs/datasets/v1.4.0/_modules/datasets/arrow_dataset.html#Dataset.add_faiss_index) but was unfruitful
Greatly appreciate it!
Is this really ascending=True
? Any updates
I agree that this is an error. We get Euclidean/L2 distance (IndexFlatL2 is the default in FAISS). We should get the response with the lowest score, no sorting is required.
cc @lewtun maybe?
hello , I am trying to follow this part of the course and trying to download Pub med dataset , but it seems that https://the-eye.eu/. no longer host the dataset ,
I also browsed the https://the-eye.eu/public/AI where I found no folder with name pile_preliminary_components. ,
thank you for your advice
Wesam
Hello team! Thank you so much for your work. I have learned a lot through the course.
I see thereâs been previous changes to the location of the repository. I think there has been a new one (July 2023) including a change in the dataset name, and I think itâs now on https://the-eye.eu/public/AI/pile_neox/data/PubMedCentralDataset_text_document.bin.
Could you confirm this is the right file?
Thank you again!