## Environment info
`transformers` version: 4.3.3
Platform: Linux-4.…19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.7.1+cu101 (False)
Tensorflow version (GPU?): 2.4.1 (False)
Using GPU in script?: True/False
Using distributed or parallel set-up in script?: False
### Who can help
- longformer, reformer, transfoxl, xlnet: @patrickvonplaten
- tokenizers: @LysandreJik
- trainer: @sgugger
## Information
Model I am using (Bert, XLNet ...): `LongFormer`
The problem arises when using:
* [x] the official example scripts: (give details below)
* [ ] my own modified scripts: (give details below)
The tasks I am working on is:
* [ ] an official GLUE/SQUaD task: (give the name)
* [x] my own task or dataset: (give details below)
## To reproduce
Hi, I am trying to train an LM model on a custom dataset (which is simply text over multiple lines). My choice was the Longformer, and I am using the exact same code provided officially with just a few modifications.
When I fine-tune it on a custom dataset, I am getting this error:-
```py
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-54-2f2d9c2c00fc> in <module>()
45 )
46
---> 47 train_results = trainer.train()
6 frames
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
1032 self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
1033
-> 1034 for step, inputs in enumerate(epoch_iterator):
1035
1036 # Skip past any already trained steps if resuming training
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
515 if self._sampler_iter is None:
516 self._reset()
--> 517 data = self._next_data()
518 self._num_yielded += 1
519 if self._dataset_kind == _DatasetKind.Iterable and \
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
555 def _next_data(self):
556 index = self._next_index() # may raise StopIteration
--> 557 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
558 if self._pin_memory:
559 data = _utils.pin_memory.pin_memory(data)
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
<ipython-input-53-5e4959dcf50c> in __getitem__(self, idx)
7
8 def __getitem__(self, idx):
----> 9 item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
10 item['labels'] = torch.tensor(self.labels[idx])
11 return item
<ipython-input-53-5e4959dcf50c> in <dictcomp>(.0)
7
8 def __getitem__(self, idx):
----> 9 item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
10 item['labels'] = torch.tensor(self.labels[idx])
11 return item
IndexError: list index out of range
```
Most probably it is a tokenization problem, but I can't seem to locate it.
I ensured that the tokenizer in the **LM** does accept an appropriate length (even if it is quite bigger than I want):
`tokenizer = LongformerTokenizerFast.from_pretrained("./ny_model", max_len=3500)`
For fine-tuning, I ensured that it would truncate&pad, though none of my data samples are long enough to truncate:
```py
train_encodings = tokenizer(list(train_text), truncation=True, padding=True, max_length=3500)
val_encodings = .....
```
Finally, I tried with some dummy data with _fixed length_ like this:
```py
train_text = ['a', 'b']
val_text = ['c', 'd']
```
Which rules out most tokenization errors.
I am fine-tuning in accordance to official scripts - something I have done before. the LM looks good to me and tokenizes individually as well, so I have no reason to suspect it.
I am attaching my **LM** code:-
```py
!pip install -q git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
%%time
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files='./NYA.txt', vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
!mkdir ny_model
tokenizer.save_model("ny_model")
from transformers import LongformerConfig
config = LongformerConfig(
vocab_size=52_000,
max_position_embeddings=514,
num_attention_heads=2,
num_hidden_layers=1,
type_vocab_size=1,
)
from transformers import LongformerTokenizerFast
tokenizer = LongformerTokenizerFast.from_pretrained("./ny_model", max_len=3500)
from transformers import LongformerForMaskedLM
model = LongformerForMaskedLM(config=config)
%%time
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./NYA.txt",
block_size=128,
)
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
overwrite_output_dir=True,
num_train_epochs=2,
per_device_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
prediction_loss_only=True,
learning_rate=1e-5,
logging_steps=50,
fp16=True
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=data_collator
)
trainer.train()
```
and as said again, the fine-tuning part is again just like the official scripts, save the tokenizer arguments and some simple training args.
I believe that this code with a simple dummy dataset could reproduce the bug. I can provide further help on the gist if someone can create one for full reproducibility. If there is some idiotic mistake I have made, please don't hesitate to point that out.
> Any Ideas what the problem might be?
Cheers