The 🤗 Datasets library - Hugging Face Course

In the above link, I am not able to understand anything about the parameter: return_overflowing_tokens used in the tokenizer object and also what is this error message and why are we getting:

ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000

Hi !

I think return_overflowing_tokens can be used to keep the overflowing tokens in case an example is longer than max_length. For example for max_length=4 and for an example consisting of 10 tokens:

  • if return_overflowing_tokens=False, then the example is cropped and you get one list of 4 tokens
  • if return_overflowing_tokens=True, then the example is split to lists of maximum 4 tokens, so you end up with three lists with length 4, 4 and 2.

The error

ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000

means that there is a mismatch between the number of rows for condition and the other rows created by the tokenize function. Indeed from 1,000 examples, with each example having columns like condition, the tokenize function returned 1,463 tokenized texts. Because some columns have more rows than others, it can’t form a valid dataset table.

But since in the end you don’t care about the columns function at this point, you can just drop it and only keep the 1,463 tokenized texts with remove_columns=drug_dataset["train"].column_names

I hope that helps :slight_smile: