I’m currently working with BERT applied in time sequences. I’m testing whether or not my inputs are proper for BERT to understand using:
config = BertConfig(vocab_size = 30003,
num_attention_heads=12,
num_hidden_layers=12)
model = BertForPreTraining(config)
outputs = model(torch.LongTensor(inputs["labels"][54581]).view(-1,43))
Where I used this vocab_size because my tokens range from 0 to 30000 and I added two special tokens, which I named 30002 and 30003, this the size, and I’m resizing the input to (1,43) since I’m just trying to predict a single sequence of length 43 (containing CLS and SEP tokens)…
The input above is of the form:
tensor([30002, 12189, 12818, 13938, 15092, 15906, 16238, 16138, 15772, 15349,
15094, 15193, 15740, 16740, 18137, 19763, 21208, 21979, 21630, 19799,
16651, 30003, 14003, 13028, 12250, 11881, 12082, 12807, 13975, 15462,
17065, 18514, 19534, 19937, 19843, 19390, 18737, 18047, 17449, 16976,
16575, 16139, 30003])
which seems to be the format BERT recognizes. I also added a token -100 representing [MASK], as suggested in the docs. Then the following error shows:
/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2041 # remove once script supports set_grad_enabled
2042 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2044
2045
IndexError: index out of range in self
The error is pretty long, so I just put the last iteration. I’m not understanding what I’m supposed to do now. Can anyone give me a clue of what I can do here? Thanks a lot!