How to train LayoutLMv2 on the Sequence Classification task in AWS Sagemaker?

Hi,

in order to train a model LayoutLMv2 on the Sequence Classification task on AWS Sagemaker (inspiration from Fine-tuning LayoutLMForSequenceClassification on RVL-CDIP.ipynb of @nielsr) through a script running in a training DL container (DLC) of Hugging Face, I need to import the class LayoutLMv2ForSequenceClassification but it generates an error.

Here is the code (inspiration from 01_getting_started_pytorch of @philschmid) of my AWS Sagemaker notebook that runs my Hugging Face Estimator that installs the DLC and then, run the script LayoutLMForSequenceClassification.py (I have no problem with the code that defines the HF estimator; the problem comes after the DLC installation when the script LayoutLMForSequenceClassification.py is run: the error comes from the importation of the class LayoutLMv2ForSequenceClassification):

!pip install "sagemaker>=2.48.0" "transformers==4.12.3" "datasets[s3]==1.18.3" --upgrade

import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

import sagemaker.huggingface
from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(entry_point='LayoutLMForSequenceClassification.py',
                            source_dir='./scripts',
                            instance_type='ml.g4dn.4xlarge', #'ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.17.0',
                            pytorch_version='1.10.2',
                            py_version='py38',
                            )

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit()

I copy here the content of the script LayoutLMForSequenceClassification.py I tested and the whole error message.

Command that gives an error

It looks like that the problems comes from this command:
from transformers import LayoutLMv2ForSequenceClassification

Who can help me on finding the solution of this issue?
Thanks you.

# source: https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb

import os

# pyyaml & detectron 2
os.system("pip install -q pyyaml==5.1")
os.system("python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'")

# tesseract
os.system('chmod 777 /tmp')
os.system('apt-get update -y')
os.system('apt-get install tesseract-ocr -y')
os.system('pip install -q pytesseract')

print(">>>>>>>>>>>>>>>> pyyaml, detectron2, tesseract-ocr and pytesseract installed!")

if __name__ == "__main__":
    
    import torch
    import pytesseract
    
    model_name = "microsoft/layoutlmv2-base-uncased"
    
    # feature_extrator & tokenizer
    from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2Tokenizer, LayoutLMv2Processor
    feature_extractor = LayoutLMv2FeatureExtractor()
    tokenizer = LayoutLMv2Tokenizer.from_pretrained(model_name)
    processor = LayoutLMv2Processor(feature_extractor, tokenizer)
    
    print(">>>>>>>>>>>>>>>> tokenizer and processor downloaded!")

    from transformers import LayoutLMv2ForSequenceClassification
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = LayoutLMv2ForSequenceClassification.from_pretrained(model_name, num_labels=2)
    model.to(device)
    
    print(">>>>>>>>>>>>>>>> model downloaded!")

And here, the error message:

2022-06-23 12:47:16 Starting - Starting the training job...
2022-06-23 12:47:31 Starting - Preparing the instances for trainingProfilerReport-1655988435: InProgress
......
2022-06-23 12:48:39 Downloading - Downloading input data...
2022-06-23 12:49:14 Training - Downloading the training image..........................bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
/opt/conda/lib/python3.8/site-packages/paramiko/transport.py:236: CryptographyDeprecationWarning: Blowfish has been deprecated
  "class": algorithms.Blowfish,
2022-06-23 12:53:28,754 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2022-06-23 12:53:28,776 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-06-23 12:53:28,784 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-06-23 12:53:29,332 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {},
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {},
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {},
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "huggingface-pytorch-training-2022-06-23-12-47-15-619",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-east-2-342475165949/huggingface-pytorch-training-2022-06-23-12-47-15-619/source/sourcedir.tar.gz",
    "module_name": "test_LayoutLMv2ForSequenceClassification",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.g4dn.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.g4dn.xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "test_LayoutLMv2ForSequenceClassification.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={}
SM_USER_ENTRY_POINT=test_LayoutLMv2ForSequenceClassification.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=test_LayoutLMv2ForSequenceClassification
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-2-342475165949/huggingface-pytorch-training-2022-06-23-12-47-15-619/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"huggingface-pytorch-training-2022-06-23-12-47-15-619","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-2-342475165949/huggingface-pytorch-training-2022-06-23-12-47-15-619/source/sourcedir.tar.gz","module_name":"test_LayoutLMv2ForSequenceClassification","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"test_LayoutLMv2ForSequenceClassification.py"}
SM_USER_ARGS=[]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages:/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220512-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument-3.4.2-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument_cext-0.2.4-py3.8-linux-x86_64.egg
Invoking script with the following command:
/opt/conda/bin/python3.8 test_LayoutLMv2ForSequenceClassification.py
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

2022-06-23 12:53:35 Training - Training image download completed. Training in progress.

(...) # here, I did not copy all messages of installation (pyyaml, detectron2, tesseract-ocr and pytesseract)

>>>>>>>>>>>>>>>> pyyaml, detectron2, tesseract-ocr and pytesseract installed!
Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 226k/226k [00:00<00:00, 5.30MB/s]
Downloading:   0%|          | 0.00/707 [00:00<?, ?B/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 707/707 [00:00<00:00, 664kB/s]
>>>>>>>>>>>>>>>> tokenizer and processor downloaded!
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2777, in _get_module
return importlib.import_module("." + module_name, self.__name__)
  File "/opt/conda/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 848, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/opt/conda/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 48, in <module>
from detectron2.modeling import META_ARCH_REGISTRY
  File "/opt/conda/lib/python3.8/site-packages/detectron2/modeling/__init__.py", line 2, in <module>
from detectron2.layers import ShapeSpec
  File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/__init__.py", line 2, in <module>
from .batch_norm import FrozenBatchNorm2d, get_norm, NaiveSyncBatchNorm, CycleBatchNormList
  File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/batch_norm.py", line 4, in <module>
    from fvcore.nn.distributed import differentiable_all_reduce
  File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/__init__.py", line 4, in <module>
    from .focal_loss import (
  File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 52, in <module>
sigmoid_focal_loss_jit: "torch.jit.ScriptModule" = torch.jit.script(sigmoid_focal_loss)
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1310, in script
fn = torch._C._jit_script_compile(
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_recursive.py", line 838, in try_compile_fn
return torch.jit.script(fn, _rcb=rcb)
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1310, in script
fn = torch._C._jit_script_compile(
RuntimeError: 
undefined value has_torch_function_variadic:
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962
         >>> loss.backward()
    """
    if has_torch_function_variadic(input, target, weight, pos_weight):
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return handle_torch_function(
            binary_cross_entropy_with_logits,
'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'
  File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36
    targets = targets.float()
    p = torch.sigmoid(inputs)
    ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    p_t = p * targets + (1 - p) * (1 - targets)
    loss = ce_loss * ((1 - p_t) ** gamma)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "test_LayoutLMv2ForSequenceClassification.py", line 30, in <module>
from transformers import LayoutLMv2ForSequenceClassification
  File "<frozen importlib._bootstrap>", line 1039, in _handle_fromlist
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2768, in __getattr__
    value = getattr(module, name)
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2767, in __getattr__
module = self._get_module(self._class_to_module[name])
  File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2779, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.layoutlmv2.modeling_layoutlmv2 because of the following error (look up to see its traceback):
undefined value has_torch_function_variadic:
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962
         >>> loss.backward()
    """
    if has_torch_function_variadic(input, target, weight, pos_weight):
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return handle_torch_function(
            binary_cross_entropy_with_logits,
'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'
  File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36
    targets = targets.float()
    p = torch.sigmoid(inputs)
    ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    p_t = p * targets + (1 - p) * (1 - targets)
    loss = ce_loss * ((1 - p_t) ** gamma)
2022-06-23 12:59:41,840 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2022-06-23 12:59:41,840 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ExitCode 1
ErrorMessage "RuntimeError: 
 undefined value has_torch_function_variadic:   File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962          >>> loss.backward()     """     if has_torch_function_variadic(input, target, weight, pos_weight):        ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE         return handle_torch_function(             binary_cross_entropy_with_logits, 'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'   File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36     targets = targets.float()     p = torch.sigmoid(inputs)     ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE     p_t = p * targets + (1 - p) * (1 - targets)     loss = ce_loss * ((1 - p_t) ** gamma)  The above exception was the direct cause of the following exception: Traceback (most recent call last):   File "test_LayoutLMv2ForSequenceClassification.py", line 30, in <module> from transformers import LayoutLMv2ForSequenceClassification   File "<frozen importlib._bootstrap>", line 1039, in _handle_fromlist   File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2768, in __getattr__     value = getattr(module, name)   File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2767, in __getattr__ module = self._get_module(self._class_to_module[name])   File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2779, in _get_module raise RuntimeError( RuntimeError: Failed to import transformers.models.layoutlmv2.modeling_layoutlmv2 because of the following error (look up to see its traceback):"
Command "/opt/conda/bin/python3.8 test_LayoutLMv2ForSequenceClassification.py"
2022-06-23 12:59:41,840 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2022-06-23 13:00:19 Uploading - Uploading generated training model
2022-06-23 13:00:19 Failed - Training job failed
ProfilerReport-1655988435: Stopping

1 Like

@pierreguillou did you ever figure out what caused this? I’m 90% sure its related to detectron2, but when it happens for me I know detectron2 is part of the docker container…

I think this error occurs because of running on a machine without some combination of pytorch & a GPU; not quite sure though

I am also having this issue - as @plamb suggests this is likely due to some incompatibility between installed detectron2, torch/cu versions

Note this github issue thread for any sagemaker users: LayoutLMv2 training on sagemaker error: undefined value has_torch_function_variadic Β· Issue #17855 Β· huggingface/transformers Β· GitHub