Overflowing Tokens in MarkupLM

jl2010 · March 31, 2023, 8:31pm

When using the MarkupLMTokenizerFast there is an argument to return_overflowing_tokens. This makes sense since the model can only handle 512 tokens at a time. For example, if my data has 1024 tokens the tokenizer would return a tensor of size(2,512).

My question is, does the model consider the overflowing tokens when it’s training? If so can you point me to where it’s actually training on that data? It’s not clear to me that this is happening and want to make sure either way.

Thank you!

Topic		Replies	Views
MarkupLM Long Document Chunking 🤗Transformers	0	410	January 23, 2023
Why is overflow_to_sample_mapping missing? Beginners	1	1381	February 1, 2023
How to use markupLM for QA on HTML text longer than 512 tokens? Models	0	384	February 26, 2023
MarkupLM model applied to html longer than 512 🤗Transformers	0	230	February 11, 2023
Layoutlmv2 token classification on documents having tokens larger than 512 Models	8	2311	October 20, 2022

Overflowing Tokens in MarkupLM

Related topics