return_overflowing_tokens can be used to keep the overflowing tokens in case an example is longer than
max_length. For example for max_length=4 and for an example consisting of 10 tokens:
return_overflowing_tokens=False, then the example is cropped and you get one list of 4 tokens
return_overflowing_tokens=True, then the example is split to lists of maximum 4 tokens, so you end up with three lists with length 4, 4 and 2.
ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
means that there is a mismatch between the number of rows for
condition and the other rows created by the tokenize function. Indeed from 1,000 examples, with each example having columns like
condition, the tokenize function returned 1,463 tokenized texts. Because some columns have more rows than others, it can’t form a valid dataset table.
But since in the end you don’t care about the
columns function at this point, you can just drop it and only keep the 1,463 tokenized texts with
I hope that helps