Inference just halts, no error, how to troubleshoot

Hello, I have this problem I just don’t know how to deal with it.

I have a fine tuned model for LayoutLM Token Classification. When I try to do prediction with model(input_ids=t_input_ids, bbox=t_bbox, attention_mask=t_attention_mask) the program just hangs. No error, no cpu or gpu usage.

I just don’t know how I should start to troubleshot this? Any suggestions?

Still trying to fix this problem, the model works fine in on my computer but when i run it in my docker container it just halts. No resources used and no error. The input is the same in both cases.

Problem is I don’t know how to troubleshoot this since there is no error.

It might be an underlying cuda installation error. Do commands like these work?

nvidia-smi
nvcc --version
python
>>> import torch
>>> t = torch.tensor([1,2,3]).to("cuda")

Thanks for answering!

Im running on the cpu on my develop machine. So i don’t have nvidia-smi or nvcc installed.

But:

>>> import torch
>>> t = torch.tensor([1,2,3]).to("cpu")
>>> t
tensor([1, 2, 3])

Works fine…

my Dockerfile is:

FROM huggingface/transformers-pytorch-gpu
RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip
WORKDIR /code
COPY ./requirements.txt ./
RUN python3 -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt
COPY ./ai .ai

I did try a basic transformer example and it works great inside the container.

I’m still working on this, after running debugpy on the docker container the program stops after this line:

torch/nn/functional.py:2233
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

The debugger stops and no error message is produced, even when trying to step into the line.

Anyone has ideas?

Ok in that case it seems like an out of bounds issue with your tokens. One of these two might be causing it:

  • You added special tokens to the tokenizer but didn’t resize the model’s embedding. So there’s a token with an ID greater than the size of the embedding layer causing an index error
  • Maybe you’re masking parts of your input_ids with -100 and those are being fed through the embedding layer (negative indexes can’t be passed through the embedding layer)

Thanks, will check this out.