Hello all. I’m working on using MarkupLM transformer for an html page. These end up way of over max_length of 512 tokens. I notice when I set return_overflowing_tokens=True, I will get back the extra tokens after it get’s truncated. The problem is I end up with uneven tensors. For example my first dataset item looks like:
input_ids torch.Size([57, 512])
token_type_ids torch.Size([57, 512])
attention_mask torch.Size([57, 512])
overflow_to_sample_mapping torch.Size([57])
xpath_tags_seq torch.Size([57, 512, 50])
xpath_subs_seq torch.Size([57, 512, 50])
labels torch.Size([57, 512])
whereas the second looks like
input_ids torch.Size([7, 512])
token_type_ids torch.Size([7, 512])
attention_mask torch.Size([7, 512])
overflow_to_sample_mapping torch.Size([7])
xpath_tags_seq torch.Size([7, 512, 50])
xpath_subs_seq torch.Size([7, 512, 50])
labels torch.Size([7, 512])
Is there a way to automatically pad these tensors in the second dimension? I tried doing it manually (using just two dataset items, verifying exact same shape) and passing it to the trainer but ended up with:
RuntimeError: The size of tensor a (512) must match the size of tensor b (50) at non-singleton dimension 2
Not sure if I’m going about this the correct way but any guidance or thoughts would be appreciated. Thank you.