How to train LayoutLMv2 on the Sequence Classification task in AWS Sagemaker?


in order to train a model LayoutLMv2 on the Sequence Classification task on AWS Sagemaker (inspiration from Fine-tuning LayoutLMForSequenceClassification on RVL-CDIP.ipynb of @nielsr) through a script running in a training DL container (DLC) of Hugging Face, I need to import the class LayoutLMv2ForSequenceClassification but it generates an error.

Here is the code (inspiration from 01_getting_started_pytorch of @philschmid) of my AWS Sagemaker notebook that runs my Hugging Face Estimator that installs the DLC and then, run the script (I have no problem with the code that defines the HF estimator; the problem comes after the DLC installation when the script is run: the error comes from the importation of the class LayoutLMv2ForSequenceClassification):

!pip install "sagemaker>=2.48.0" "transformers==4.12.3" "datasets[s3]==1.18.3" --upgrade

import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

import sagemaker.huggingface
from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(entry_point='',
                            instance_type='ml.g4dn.4xlarge', #'ml.p3.2xlarge',

# starting the train job with our uploaded datasets as input

I copy here the content of the script I tested and the whole error message.

Command that gives an error

It looks like that the problems comes from this command:
from transformers import LayoutLMv2ForSequenceClassification

Who can help me on finding the solution of this issue?
Thanks you.

# source:

import os

# pyyaml & detectron 2
os.system("pip install -q pyyaml==5.1")
os.system("python -m pip install 'git+'")

# tesseract
os.system('chmod 777 /tmp')
os.system('apt-get update -y')
os.system('apt-get install tesseract-ocr -y')
os.system('pip install -q pytesseract')

print(">>>>>>>>>>>>>>>> pyyaml, detectron2, tesseract-ocr and pytesseract installed!")

if __name__ == "__main__":
    import torch
    import pytesseract
    model_name = "microsoft/layoutlmv2-base-uncased"
    # feature_extrator & tokenizer
    from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2Tokenizer, LayoutLMv2Processor
    feature_extractor = LayoutLMv2FeatureExtractor()
    tokenizer = LayoutLMv2Tokenizer.from_pretrained(model_name)
    processor = LayoutLMv2Processor(feature_extractor, tokenizer)
    print(">>>>>>>>>>>>>>>> tokenizer and processor downloaded!")

    from transformers import LayoutLMv2ForSequenceClassification
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = LayoutLMv2ForSequenceClassification.from_pretrained(model_name, num_labels=2)
    print(">>>>>>>>>>>>>>>> model downloaded!")

And here, the error message:

2022-06-23 12:47:16 Starting - Starting the training job...
2022-06-23 12:47:31 Starting - Preparing the instances for trainingProfilerReport-1655988435: InProgress
2022-06-23 12:48:39 Downloading - Downloading input data...
2022-06-23 12:49:14 Training - Downloading the training image..........................bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
/opt/conda/lib/python3.8/site-packages/paramiko/ CryptographyDeprecationWarning: Blowfish has been deprecated
  "class": algorithms.Blowfish,
2022-06-23 12:53:28,754 sagemaker-training-toolkit INFO     Imported framework
2022-06-23 12:53:28,776 INFO     Block until all host DNS lookups succeed.
2022-06-23 12:53:28,784 INFO     Invoking user training script.
2022-06-23 12:53:29,332 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
    "additional_framework_parameters": {},
    "channel_input_dirs": {},
    "current_host": "algo-1",
    "framework_module": "",
    "hosts": [
    "hyperparameters": {},
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {},
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "huggingface-pytorch-training-2022-06-23-12-47-15-619",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-east-2-342475165949/huggingface-pytorch-training-2022-06-23-12-47-15-619/source/sourcedir.tar.gz",
    "module_name": "test_LayoutLMv2ForSequenceClassification",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.g4dn.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
        "instance_groups": [
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.g4dn.xlarge",
                "hosts": [
        "network_interface_name": "eth0"
    "user_entry_point": ""
Environment variables:
Invoking script with the following command:
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead:

2022-06-23 12:53:35 Training - Training image download completed. Training in progress.

(...) # here, I did not copy all messages of installation (pyyaml, detectron2, tesseract-ocr and pytesseract)

>>>>>>>>>>>>>>>> pyyaml, detectron2, tesseract-ocr and pytesseract installed!
Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 226k/226k [00:00<00:00, 5.30MB/s]
Downloading:   0%|          | 0.00/707 [00:00<?, ?B/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 707/707 [00:00<00:00, 664kB/s]
>>>>>>>>>>>>>>>> tokenizer and processor downloaded!
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/transformers/", line 2777, in _get_module
return importlib.import_module("." + module_name, self.__name__)
  File "/opt/conda/lib/python3.8/importlib/", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 848, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/opt/conda/lib/python3.8/site-packages/transformers/models/layoutlmv2/", line 48, in <module>
from detectron2.modeling import META_ARCH_REGISTRY
  File "/opt/conda/lib/python3.8/site-packages/detectron2/modeling/", line 2, in <module>
from detectron2.layers import ShapeSpec
  File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/", line 2, in <module>
from .batch_norm import FrozenBatchNorm2d, get_norm, NaiveSyncBatchNorm, CycleBatchNormList
  File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/", line 4, in <module>
    from fvcore.nn.distributed import differentiable_all_reduce
  File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/", line 4, in <module>
    from .focal_loss import (
  File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/", line 52, in <module>
sigmoid_focal_loss_jit: "torch.jit.ScriptModule" = torch.jit.script(sigmoid_focal_loss)
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/", line 1310, in script
fn = torch._C._jit_script_compile(
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/", line 838, in try_compile_fn
return torch.jit.script(fn, _rcb=rcb)
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/", line 1310, in script
fn = torch._C._jit_script_compile(
undefined value has_torch_function_variadic:
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/", line 2962
         >>> loss.backward()
    if has_torch_function_variadic(input, target, weight, pos_weight):
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return handle_torch_function(
'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'
  File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/", line 36
    targets = targets.float()
    p = torch.sigmoid(inputs)
    ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    p_t = p * targets + (1 - p) * (1 - targets)
    loss = ce_loss * ((1 - p_t) ** gamma)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "", line 30, in <module>
from transformers import LayoutLMv2ForSequenceClassification
  File "<frozen importlib._bootstrap>", line 1039, in _handle_fromlist
  File "/opt/conda/lib/python3.8/site-packages/transformers/", line 2768, in __getattr__
    value = getattr(module, name)
  File "/opt/conda/lib/python3.8/site-packages/transformers/", line 2767, in __getattr__
module = self._get_module(self._class_to_module[name])
  File "/opt/conda/lib/python3.8/site-packages/transformers/", line 2779, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.layoutlmv2.modeling_layoutlmv2 because of the following error (look up to see its traceback):
undefined value has_torch_function_variadic:
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/", line 2962
         >>> loss.backward()
    if has_torch_function_variadic(input, target, weight, pos_weight):
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return handle_torch_function(
'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'
  File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/", line 36
    targets = targets.float()
    p = torch.sigmoid(inputs)
    ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    p_t = p * targets + (1 - p) * (1 - targets)
    loss = ce_loss * ((1 - p_t) ** gamma)
2022-06-23 12:59:41,840 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2022-06-23 12:59:41,840 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ExitCode 1
ErrorMessage "RuntimeError: 
 undefined value has_torch_function_variadic:   File "/opt/conda/lib/python3.8/site-packages/torch/utils/", line 2962          >>> loss.backward()     """     if has_torch_function_variadic(input, target, weight, pos_weight):        ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE         return handle_torch_function(             binary_cross_entropy_with_logits, 'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'   File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/", line 36     targets = targets.float()     p = torch.sigmoid(inputs)     ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE     p_t = p * targets + (1 - p) * (1 - targets)     loss = ce_loss * ((1 - p_t) ** gamma)  The above exception was the direct cause of the following exception: Traceback (most recent call last):   File "", line 30, in <module> from transformers import LayoutLMv2ForSequenceClassification   File "<frozen importlib._bootstrap>", line 1039, in _handle_fromlist   File "/opt/conda/lib/python3.8/site-packages/transformers/", line 2768, in __getattr__     value = getattr(module, name)   File "/opt/conda/lib/python3.8/site-packages/transformers/", line 2767, in __getattr__ module = self._get_module(self._class_to_module[name])   File "/opt/conda/lib/python3.8/site-packages/transformers/", line 2779, in _get_module raise RuntimeError( RuntimeError: Failed to import transformers.models.layoutlmv2.modeling_layoutlmv2 because of the following error (look up to see its traceback):"
Command "/opt/conda/bin/python3.8"
2022-06-23 12:59:41,840 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2022-06-23 13:00:19 Uploading - Uploading generated training model
2022-06-23 13:00:19 Failed - Training job failed
ProfilerReport-1655988435: Stopping

1 Like

@pierreguillou did you ever figure out what caused this? I’m 90% sure its related to detectron2, but when it happens for me I know detectron2 is part of the docker container…

I think this error occurs because of running on a machine without some combination of pytorch & a GPU; not quite sure though

I am also having this issue - as @plamb suggests this is likely due to some incompatibility between installed detectron2, torch/cu versions

Note this github issue thread for any sagemaker users: LayoutLMv2 training on sagemaker error: undefined value has_torch_function_variadic Β· Issue #17855 Β· huggingface/transformers Β· GitHub