Wonderful! Thanks everyone for their assistance through all of this. I feel that we are getting close to the end!
I updated to version 3.0.2 and ran the code, producing the following error:
Epoch: 0%| | 0/1 [00:00<?, ?it/s]
Iteration: 0%| | 0/16 [00:00<?, ?it/s]Traceback (most recent call last):
File "/path/to/finetune_test.py", line 28, in <module>
trainer.train()
File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 492, in train
for step, inputs in enumerate(epoch_iterator):
File "/path/to/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 1107, in __iter__
for obj in iterable:
File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 61, in default_data_collator
batch[k] = torch.stack([f[k] for f in features])
RuntimeError: stack expects each tensor to be equal size, but got [1, 149] at entry 0 and [1, 244] at entry 1
Epoch: 0%| | 0/1 [00:00<?, ?it/s]
Iteration: 0%| | 0/16 [00:00<?, ?it/s]
I assume this has to do with padding
and truncating
in the GPT2Tokenizer
(which I briefly posted about here.
In my DataSet
object, I’m going to guess that I need to set the padding
and truncation
parameters in
input_ids = self.tokenizer.encode(abstract_text, return_tensors='pt')