DataCollatorForTokenClassification is used:
from transformers import DataCollatorForTokenClassification data_collator = DataCollatorForTokenClassification(tokenizer) ... trainer = Trainer( model, args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], data_collator=data_collator, tokenizer=tokenizer, compute_metrics=compute_metrics )
I am trying to figure out what the real purpose of this is. It appears that the purpose of
DataCollatorForTokenClassification is for padding, truncation, etc. But you can also do this in the tokenizer. Why do we need this extra thing, then? Is it because DataCollator does it per batch instead on the fly and is more efficient?