I have a dataset used to train a token classification model.
This dataset contains 3 columns: tokens, labels and frequencies (float numbers).
These float values will be concatenated to last_hidden_state, just before calling classifier.
Unfortunately, the DataCollatorForTokenClassification does not accept “frequencies” column.
I am receiving the following error:
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/transformers/tokenization_utils_base.py”, line 748, in convert_to_tensors
tensor = as_tensor(value)
^^^^^^^^^^^^^^^^
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/transformers/tokenization_utils_base.py”, line 720, in as_tensor
return torch.tensor(value)
^^^^^^^^^^^^^^^^^^^
ValueError: expected sequence of length 5 at dim 1 (got 14)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “/Users/chopin/PycharmProjects/term-extractor/train.py”, line 136, in
trainer.train()
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/transformers/trainer.py”, line 1591, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/transformers/trainer.py”, line 1870, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/accelerate/data_loader.py”, line 384, in iter
current_batch = next(dataloader_iter)
^^^^^^^^^^^^^^^^^^^^^
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/torch/utils/data/dataloader.py”, line 630, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/torch/utils/data/dataloader.py”, line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py”, line 54, in fetch
return self.collate_fn(data)
^^^^^^^^^^^^^^^^^^^^^
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/transformers/data/data_collator.py”, line 45, in call
return self.torch_call(features)
^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/transformers/data/data_collator.py”, line 310, in torch_call
batch = self.tokenizer.pad(
^^^^^^^^^^^^^^^^^^^
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/transformers/tokenization_utils_base.py”, line 3303, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/transformers/tokenization_utils_base.py”, line 223, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File “/Users/chopin/anaconda3/envs/learn/lib/python3.11/site-packages/transformers/tokenization_utils_base.py”, line 764, in convert_to_tensors
raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length. Perhaps your features (frequencies
in this case) have excessive nesting (inputs type list
where type int
is expected).
0%| | 0/27360 [00:00<?, ?it/s]
What should I do?