The 🤗 Datasets library - Hugging Face Course

Modfiededition · November 22, 2021, 9:36am

In the above link, I am not able to understand anything about the parameter: return_overflowing_tokens used in the tokenizer object and also what is this error message and why are we getting:

ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000

lhoestq · November 25, 2021, 4:44pm

Hi !

I think return_overflowing_tokens can be used to keep the overflowing tokens in case an example is longer than max_length. For example for max_length=4 and for an example consisting of 10 tokens:

if return_overflowing_tokens=False, then the example is cropped and you get one list of 4 tokens
if return_overflowing_tokens=True, then the example is split to lists of maximum 4 tokens, so you end up with three lists with length 4, 4 and 2.

The error

ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000

means that there is a mismatch between the number of rows for condition and the other rows created by the tokenize function. Indeed from 1,000 examples, with each example having columns like condition, the tokenize function returned 1,463 tokenized texts. Because some columns have more rows than others, it can’t form a valid dataset table.

But since in the end you don’t care about the columns function at this point, you can just drop it and only keep the 1,463 tokenized texts with remove_columns=drug_dataset["train"].column_names

I hope that helps

Topic		Replies	Views
Map with batch=True gives ArrowInvalid error for mismatch in a column's expected length 🤗Datasets	1	903	December 12, 2023
ArrowInvalid: Column 1 named id expected length 512 but got length 1000 🤗Datasets	4	15282	June 6, 2024
NER Label tokenization with overflowing tokens 🤗Tokenizers	4	1434	November 6, 2023
What does this warning mean? -overflowing tokens are not returned for the setting you have chosen 🤗Tokenizers	1	5392	March 30, 2022
Token Classification: How to tokenize and align labels with overflow and stride? 🤗Tokenizers	4	6145	July 22, 2024

The 🤗 Datasets library - Hugging Face Course

Related topics