DataCollatorWithPaddings without Tokenizer

I want to fine-tune a model…
model = BertForTokenClassification.from_pretrained('monilouise/ner_pt_br'
with this dataset:
raw_datasets = load_dataset('lener_br')

The raw_datasets loaded are already tokenized and encoded. And I don’t know how it was tokenized. Now, I want to pad the inputs, but I don’t know how to use DataCollatorWithPaddings in this case.

I noticed that this dataset is similar to wnut dataset from the docs. Still, I can’t figure out what should I do.

You can use the base BERT tokenizer I would say (since it’s a BERT model). Just make sure the pad token is compatible with what the model expects.

1 Like

Is there a way to check this from the downloaded model? Or this is something I will find in the model card?

Check the model config pad_token_id field.

2 Likes