I am trying to figure out what the real purpose of this is. It appears that the purpose of DataCollatorForTokenClassification is for padding, truncation, etc. But you can also do this in the tokenizer. Why do we need this extra thing, then? Is it because DataCollator does it per batch instead on the fly and is more efficient?
you’re spot on about using data collators to do padding on-the-fly. to understand why this helps, consider the following scenarios:
use the tokenizer to pad each example in the dataset to the length of the longest example in the dataset
use the tokenizer and DataCollatorWithPadding (docs) to pad each example in a batch to the length of the longest example in the batch
clearly, scenario 2 is more efficient, especially in cases where a few examples happen to be much longer than the median length and scenario 1 would introduce a lot of unnecessary padding.