LayoutXLM training - index out of bounds: 0 <= tmp30 < 1L

I am getting an error during inference and I am desperate after days of debugging - hoping for any help! Thank you!

I am training on

Ubuntu 22.04
NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3
cuda 12.1
cudnn 9.1
datasets 2.15.0
transformers 4.36.2
torch 2.4

Document question answering, custom dataset.

Model repo being trained:

copied from: microsoft/layoutxlm-base · Hugging Face

I am getting this error during evaluation (prior to training)(always with the same samle, as far as I can tell):

5%|▌         | 5901/109877 [56:31<16:12:41,  1.78it/s]terminate called after throwing an instance of 'c10::Error'
  what():  index out of bounds: 0 <= tmp30 < 1L
Exception raised from kernel at /tmp/torchinductor_aiteam/li/cliz2c63uoa3repoiaztoizrjecjxefsfbjltc6wzfp7p6brqesb.cpp:155 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f486a6d0f86 in /home/aiteam/miniconda3/envs/hf_layoutLM_test/lib/python3.9/site-packages/torch/lib/
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f486a67fdd9 in /home/aiteam/miniconda3/envs/hf_layoutLM_test/lib/python3.9/site-packages/torch/lib/
frame #2: <unknown function> + 0x432b (0x7f47aa16032b in /tmp/torchinductor_aiteam/li/
frame #3: <unknown function> + 0x16405 (0x7f48b92aa405 in /home/aiteam/miniconda3/envs/hf_layoutLM_test/lib/python3.9/site-packages/torch/lib/
frame #4: <unknown function> + 0x8609 (0x7f48ba6e4609 in /lib/x86_64-linux-gnu/
frame #5: clone + 0x43 (0x7f48ba4af353 in /lib/x86_64-linux-gnu/
terminate called recursively
C++ Traceback (most recent call last):
0   c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*)
Error Message Summary:
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1725357799 (unix time) try "date -d @1725357799" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x3ea0001d000) received by PID 118784 (TID 0x7f47ff7ab700) from PID 118784 ***]

I heavily assume it is something with the data, with that particular sample.
Only, I am debugging for days, and I cannot see a systematic difference (obviously I am just missing it) between that sample and any other.
My feature:
“input_ids” → in range [0, 250002], which is the tokenizers’ vocab
“attention_mask” → in {0, 1}
“start_positions” 0 for this sample (subfinder didnt find the answer in the context)
“end_positions” 0 or this samples
“bbox” → normalized, all in [0, 1000]
“image” → uint8, all in range [0, 255]

Some (naiive) direct questions:

  1. Can the problem stem from start and end positions being 0? It should not, the tokenizer also decodes properly to " ’ ’ "

Any idea appreciated! Thank you in advance!