Chapter 5 questions

John6666 · April 14, 2025, 3:03pm

Well, not much different. We need to be careful with IterableDataset, though…

M4rran0 · April 24, 2025, 7:26pm

Same Issue here. Do you have a solution?

What worked for me was importing AutoTokenizer from transformers, and defining the tokenizer inside tokenize_function, but this takes all the time initilizing variables, and at the end it’s basically the same or more time than the original solution (without num_proc).

I think it’s related to paralelization and that there’s no tokenized defined in each thread. But there’s must be another way…

LLM-Whisperer · May 27, 2025, 3:41pm

I am finally here and looking at the tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True) error and this makes no sense to me.

In fact, elements of the libraries make no sense to me. Sometimes the result is a list of examples. Other times the result is a dictionary of columns and each column has a list of values. Very inconsistent and unitintuitive. I only mention this because the ‘solution’ is so unintuitive it is staggering.

How does removing the columns solve the problem? Doesn’t the columns contain “review”, which is the text to be tokenized? I would never conclude that I needed to remove the data I needed to process with the tokenizer.

Also, there are a lot more than 1000 examples - like 134K examples. I assume this has something to do with batching, but this is never made explicit.

The description of the problem and its solution in the course, for me at least, are not sufficient to aid understanding. The points above need to be addressed, but more important is the question, “Why is this a problem at all?” Why doesn’t the map function just deal with it without throwing an error?

slwk · July 1, 2025, 5:35pm

Hello, while going through Creating your own dataset I encountered a problem which I described here.

Long story short,
load_dataset("json", data_files="datasets-issues.jsonl", split="train")
throws
"TypeError: Couldn't cast array of type timestamp[s] to null".

When running in Jupyter Notebook the error was more cryptic:
DatasetGenerationError: An error occurred while generating the dataset

Anyways, for the workaround you can use this code:

import json

with open("datasets-issues.jsonl", "r") as f_in:
    lines = [json.loads(line) for line in f_in]

with open("datasets-issues.json", "w") as f_out:
    json.dump(lines, f_out, indent=2)

John6666 · July 1, 2025, 8:21pm

Or possibly this: TypeError: Couldn't cast array of type string to null · Issue #5525 · huggingface/datasets · GitHub

Topic		Replies	Views
The 🤗 Datasets library - Hugging Face Course 🤗Datasets	1	567	November 25, 2021
Got wrong row number of dataset viewer 🤗Hub	11	599	June 26, 2024
Undesired behavior when using load_dataset 🤗Datasets	4	945	April 17, 2023
Mapping 1 multi-element column of a dataset to multi row dataset with 1 element per row, duplicating other features 🤗Datasets	6	2527	November 4, 2022
How to operate on columns of a dataset Beginners	2	143	January 30, 2025

Chapter 5 questions

Related topics