Well, not much different. We need to be careful with IterableDataset, though…
Same Issue here. Do you have a solution?
What worked for me was importing AutoTokenizer from transformers, and defining the tokenizer inside tokenize_function, but this takes all the time initilizing variables, and at the end it’s basically the same or more time than the original solution (without num_proc).
I think it’s related to paralelization and that there’s no tokenized defined in each thread. But there’s must be another way…
I am finally here and looking at the tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True) error and this makes no sense to me.
In fact, elements of the libraries make no sense to me. Sometimes the result is a list of examples. Other times the result is a dictionary of columns and each column has a list of values. Very inconsistent and unitintuitive. I only mention this because the ‘solution’ is so unintuitive it is staggering.
How does removing the columns solve the problem? Doesn’t the columns contain “review”, which is the text to be tokenized? I would never conclude that I needed to remove the data I needed to process with the tokenizer.
Also, there are a lot more than 1000 examples - like 134K examples. I assume this has something to do with batching, but this is never made explicit.
The description of the problem and its solution in the course, for me at least, are not sufficient to aid understanding. The points above need to be addressed, but more important is the question, “Why is this a problem at all?” Why doesn’t the map function just deal with it without throwing an error?
Hello, while going through Creating your own dataset I encountered a problem which I described here.
Long story short,
load_dataset("json", data_files="datasets-issues.jsonl", split="train")
throws
"TypeError: Couldn't cast array of type timestamp[s] to null"
.
When running in Jupyter Notebook the error was more cryptic:
DatasetGenerationError: An error occurred while generating the dataset
Anyways, for the workaround you can use this code:
import json
with open("datasets-issues.jsonl", "r") as f_in:
lines = [json.loads(line) for line in f_in]
with open("datasets-issues.json", "w") as f_out:
json.dump(lines, f_out, indent=2)