Here is the code segment of my Pytorch model,
class MyModel(nn.Module):
...
def forward(self, batch, logger):
inputs = batch[0]
structs = batch[1]
# For debugging.
if structs is None:
logger.warning("structs is None detected!")
# where structs is a dict object, like {"encoder_xx": torch.tensor(...),...}.
encoder_structs = get_keys_by_prefix(structs, prefix="encoder", pop=False)
...
and the global function,
def get_keys_by_prefix(kwargs, prefix, pop=True):
keys = [k for k in kwargs.keys() if prefix in k]
rkwargs = {k: kwargs.pop(k) if pop else kwargs.get(k) for k in keys}
return rkwargs
I use HuggingFace’s Accelector to run my model on two GPUs on a cluster node.
Training the model, calling on the ‘forward’ API. After a number of iteractions (ibatch) as shown in the following log,
INFO - __main__ - Train: iepoch 0, ibatch 861
the run failed with the following traceback and threw exception,
Traceback (most recent call last):
...
File "/.../my_model.py", line 214, in forward
encoder_structs = get_keys_by_prefix(structs, prefix="encoder", pop=False)
File "/.../datautil.py", line 397, in get_keys_by_prefix
keys = [k for k in kwargs.keys() if prefix in k]
AttributeError: 'NoneType' object has no attribute 'keys'
I didn’t see the logger.warning’s output in the MyModel.forward. It means that the data ‘structs’ passed into get_keys_by_prefix was not None but somehow got lost in get_keys_by_prefix.
I ran the model several times, the exception always occurred at the same iterations (iepoch 0, ibatch 861). So, it was not random.
I have debugged it for a few days and wonder what was the cause? Any help will be appreciated.
Here is some of my system settings:
Linux: RHEL 7.9
Python: 3.8.10
torch: 1.11.0+cu113
accelerate==0.9.0