Hi,
in order to train a model LayoutLMv2
on the Sequence Classification task on AWS Sagemaker (inspiration from Fine-tuning LayoutLMForSequenceClassification on RVL-CDIP.ipynb of @nielsr) through a script running in a training DL container (DLC) of Hugging Face, I need to import the class LayoutLMv2ForSequenceClassification
but it generates an error.
Here is the code (inspiration from 01_getting_started_pytorch of @philschmid) of my AWS Sagemaker notebook that runs my Hugging Face Estimator that installs the DLC and then, run the script LayoutLMForSequenceClassification.py
(I have no problem with the code that defines the HF estimator; the problem comes after the DLC installation when the script LayoutLMForSequenceClassification.py
is run: the error comes from the importation of the class LayoutLMv2ForSequenceClassification
):
!pip install "sagemaker>=2.48.0" "transformers==4.12.3" "datasets[s3]==1.18.3" --upgrade
import sagemaker
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
import sagemaker.huggingface
from sagemaker.huggingface import HuggingFace
huggingface_estimator = HuggingFace(entry_point='LayoutLMForSequenceClassification.py',
source_dir='./scripts',
instance_type='ml.g4dn.4xlarge', #'ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
)
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit()
I copy here the content of the script LayoutLMForSequenceClassification.py
I tested and the whole error message.
Command that gives an error
It looks like that the problems comes from this command:
from transformers import LayoutLMv2ForSequenceClassification
Who can help me on finding the solution of this issue?
Thanks you.
# source: https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb
import os
# pyyaml & detectron 2
os.system("pip install -q pyyaml==5.1")
os.system("python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'")
# tesseract
os.system('chmod 777 /tmp')
os.system('apt-get update -y')
os.system('apt-get install tesseract-ocr -y')
os.system('pip install -q pytesseract')
print(">>>>>>>>>>>>>>>> pyyaml, detectron2, tesseract-ocr and pytesseract installed!")
if __name__ == "__main__":
import torch
import pytesseract
model_name = "microsoft/layoutlmv2-base-uncased"
# feature_extrator & tokenizer
from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2Tokenizer, LayoutLMv2Processor
feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = LayoutLMv2Tokenizer.from_pretrained(model_name)
processor = LayoutLMv2Processor(feature_extractor, tokenizer)
print(">>>>>>>>>>>>>>>> tokenizer and processor downloaded!")
from transformers import LayoutLMv2ForSequenceClassification
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LayoutLMv2ForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)
print(">>>>>>>>>>>>>>>> model downloaded!")
And here, the error message:
2022-06-23 12:47:16 Starting - Starting the training job...
2022-06-23 12:47:31 Starting - Preparing the instances for trainingProfilerReport-1655988435: InProgress
......
2022-06-23 12:48:39 Downloading - Downloading input data...
2022-06-23 12:49:14 Training - Downloading the training image..........................bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
/opt/conda/lib/python3.8/site-packages/paramiko/transport.py:236: CryptographyDeprecationWarning: Blowfish has been deprecated
"class": algorithms.Blowfish,
2022-06-23 12:53:28,754 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2022-06-23 12:53:28,776 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2022-06-23 12:53:28,784 sagemaker_pytorch_container.training INFO Invoking user training script.
2022-06-23 12:53:29,332 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {},
"current_host": "algo-1",
"framework_module": "sagemaker_pytorch_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {},
"input_dir": "/opt/ml/input",
"is_master": true,
"job_name": "huggingface-pytorch-training-2022-06-23-12-47-15-619",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-us-east-2-342475165949/huggingface-pytorch-training-2022-06-23-12-47-15-619/source/sourcedir.tar.gz",
"module_name": "test_LayoutLMv2ForSequenceClassification",
"network_interface_name": "eth0",
"num_cpus": 4,
"num_gpus": 1,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"current_instance_type": "ml.g4dn.xlarge",
"current_group_name": "homogeneousCluster",
"hosts": [
"algo-1"
],
"instance_groups": [
{
"instance_group_name": "homogeneousCluster",
"instance_type": "ml.g4dn.xlarge",
"hosts": [
"algo-1"
]
}
],
"network_interface_name": "eth0"
},
"user_entry_point": "test_LayoutLMv2ForSequenceClassification.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={}
SM_USER_ENTRY_POINT=test_LayoutLMv2ForSequenceClassification.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=test_LayoutLMv2ForSequenceClassification
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-2-342475165949/huggingface-pytorch-training-2022-06-23-12-47-15-619/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"huggingface-pytorch-training-2022-06-23-12-47-15-619","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-2-342475165949/huggingface-pytorch-training-2022-06-23-12-47-15-619/source/sourcedir.tar.gz","module_name":"test_LayoutLMv2ForSequenceClassification","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"test_LayoutLMv2ForSequenceClassification.py"}
SM_USER_ARGS=[]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages:/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220512-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument-3.4.2-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument_cext-0.2.4-py3.8-linux-x86_64.egg
Invoking script with the following command:
/opt/conda/bin/python3.8 test_LayoutLMv2ForSequenceClassification.py
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
2022-06-23 12:53:35 Training - Training image download completed. Training in progress.
(...) # here, I did not copy all messages of installation (pyyaml, detectron2, tesseract-ocr and pytesseract)
>>>>>>>>>>>>>>>> pyyaml, detectron2, tesseract-ocr and pytesseract installed!
Downloading: 0%| | 0.00/226k [00:00<?, ?B/s]
Downloading: 100%|ββββββββββ| 226k/226k [00:00<00:00, 5.30MB/s]
Downloading: 0%| | 0.00/707 [00:00<?, ?B/s]
Downloading: 100%|ββββββββββ| 707/707 [00:00<00:00, 664kB/s]
>>>>>>>>>>>>>>>> tokenizer and processor downloaded!
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2777, in _get_module
return importlib.import_module("." + module_name, self.__name__)
File "/opt/conda/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 848, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/opt/conda/lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 48, in <module>
from detectron2.modeling import META_ARCH_REGISTRY
File "/opt/conda/lib/python3.8/site-packages/detectron2/modeling/__init__.py", line 2, in <module>
from detectron2.layers import ShapeSpec
File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/__init__.py", line 2, in <module>
from .batch_norm import FrozenBatchNorm2d, get_norm, NaiveSyncBatchNorm, CycleBatchNormList
File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/batch_norm.py", line 4, in <module>
from fvcore.nn.distributed import differentiable_all_reduce
File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/__init__.py", line 4, in <module>
from .focal_loss import (
File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 52, in <module>
sigmoid_focal_loss_jit: "torch.jit.ScriptModule" = torch.jit.script(sigmoid_focal_loss)
File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1310, in script
fn = torch._C._jit_script_compile(
File "/opt/conda/lib/python3.8/site-packages/torch/jit/_recursive.py", line 838, in try_compile_fn
return torch.jit.script(fn, _rcb=rcb)
File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1310, in script
fn = torch._C._jit_script_compile(
RuntimeError:
undefined value has_torch_function_variadic:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962
>>> loss.backward()
"""
if has_torch_function_variadic(input, target, weight, pos_weight):
~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return handle_torch_function(
binary_cross_entropy_with_logits,
'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'
File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36
targets = targets.float()
p = torch.sigmoid(inputs)
ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
p_t = p * targets + (1 - p) * (1 - targets)
loss = ce_loss * ((1 - p_t) ** gamma)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test_LayoutLMv2ForSequenceClassification.py", line 30, in <module>
from transformers import LayoutLMv2ForSequenceClassification
File "<frozen importlib._bootstrap>", line 1039, in _handle_fromlist
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2768, in __getattr__
value = getattr(module, name)
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2767, in __getattr__
module = self._get_module(self._class_to_module[name])
File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2779, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.layoutlmv2.modeling_layoutlmv2 because of the following error (look up to see its traceback):
undefined value has_torch_function_variadic:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962
>>> loss.backward()
"""
if has_torch_function_variadic(input, target, weight, pos_weight):
~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return handle_torch_function(
binary_cross_entropy_with_logits,
'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'
File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36
targets = targets.float()
p = torch.sigmoid(inputs)
ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
p_t = p * targets + (1 - p) * (1 - targets)
loss = ce_loss * ((1 - p_t) ** gamma)
2022-06-23 12:59:41,840 sagemaker-training-toolkit ERROR Reporting training FAILURE
2022-06-23 12:59:41,840 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "RuntimeError:
undefined value has_torch_function_variadic: File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962 >>> loss.backward() """ if has_torch_function_variadic(input, target, weight, pos_weight): ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return handle_torch_function( binary_cross_entropy_with_logits, 'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss' File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36 targets = targets.float() p = torch.sigmoid(inputs) ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE p_t = p * targets + (1 - p) * (1 - targets) loss = ce_loss * ((1 - p_t) ** gamma) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "test_LayoutLMv2ForSequenceClassification.py", line 30, in <module> from transformers import LayoutLMv2ForSequenceClassification File "<frozen importlib._bootstrap>", line 1039, in _handle_fromlist File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2768, in __getattr__ value = getattr(module, name) File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2767, in __getattr__ module = self._get_module(self._class_to_module[name]) File "/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py", line 2779, in _get_module raise RuntimeError( RuntimeError: Failed to import transformers.models.layoutlmv2.modeling_layoutlmv2 because of the following error (look up to see its traceback):"
Command "/opt/conda/bin/python3.8 test_LayoutLMv2ForSequenceClassification.py"
2022-06-23 12:59:41,840 sagemaker-training-toolkit ERROR Encountered exit_code 1
2022-06-23 13:00:19 Uploading - Uploading generated training model
2022-06-23 13:00:19 Failed - Training job failed
ProfilerReport-1655988435: Stopping