LayoutLMV3 on Sagemaker

Hi everyone,
I spent 2 days and 30 training job runs trying to deploy the LayoutLMV3 to Sagemaker. So far - very little success: this is I why I am coming to the community for help

Can you please help me in finding the cause of the issue? Here is an error which I get right before the training job fails:

KeyError: 'layoutlmv3'"

Here is a wider context around that error:

model_args.model_name_or_path: microsoft/layoutlmv3-large
model_args.config_name: None
[INFO|file_utils.py:2215] 2022-12-16 14:11:36,731 >> https://huggingface.co/microsoft/layoutlmv3-large/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpjakyi4na
[INFO|file_utils.py:2215] 2022-12-16 14:11:36,731 >> https://huggingface.co/microsoft/layoutlmv3-large/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpjakyi4na
Downloading:   0%|          | 0.00/857 [00:00<?, ?B/s]
Downloading: 100%|██████████| 857/857 [00:00<00:00, 872kB/s]
[INFO|file_utils.py:2219] 2022-12-16 14:11:37,111 >> storing https://huggingface.co/microsoft/layoutlmv3-large/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/8a0e5de726f43163aedf8dc08186e7ef3b2c706adc3a01a024a6427d09e4e3f0.9009c531534232ef27cf370ef50d8628b965e90eb385fd924a3a02fd9af07213
[INFO|file_utils.py:2219] 2022-12-16 14:11:37,111 >> storing https://huggingface.co/microsoft/layoutlmv3-large/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/8a0e5de726f43163aedf8dc08186e7ef3b2c706adc3a01a024a6427d09e4e3f0.9009c531534232ef27cf370ef50d8628b965e90eb385fd924a3a02fd9af07213
[INFO|file_utils.py:2227] 2022-12-16 14:11:37,111 >> creating metadata file for /root/.cache/huggingface/transformers/8a0e5de726f43163aedf8dc08186e7ef3b2c706adc3a01a024a6427d09e4e3f0.9009c531534232ef27cf370ef50d8628b965e90eb385fd924a3a02fd9af07213
[INFO|file_utils.py:2227] 2022-12-16 14:11:37,111 >> creating metadata file for /root/.cache/huggingface/transformers/8a0e5de726f43163aedf8dc08186e7ef3b2c706adc3a01a024a6427d09e4e3f0.9009c531534232ef27cf370ef50d8628b965e90eb385fd924a3a02fd9af07213
[INFO|configuration_utils.py:648] 2022-12-16 14:11:37,111 >> loading configuration file https://huggingface.co/microsoft/layoutlmv3-large/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/8a0e5de726f43163aedf8dc08186e7ef3b2c706adc3a01a024a6427d09e4e3f0.9009c531534232ef27cf370ef50d8628b965e90eb385fd924a3a02fd9af07213
[INFO|configuration_utils.py:648] 2022-12-16 14:11:37,111 >> loading configuration file https://huggingface.co/microsoft/layoutlmv3-large/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/8a0e5de726f43163aedf8dc08186e7ef3b2c706adc3a01a024a6427d09e4e3f0.9009c531534232ef27cf370ef50d8628b965e90eb385fd924a3a02fd9af07213
Traceback (most recent call last):
  File "run_ner.py", line 631, in <module>
main()
  File "run_ner.py", line 346, in main
config = AutoConfig.from_pretrained(
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 657, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 372, in __getitem__
raise KeyError(key)
KeyError: 'layoutlmv3'

Here is the code which launches training job:

import sagemaker
from sagemaker.huggingface import HuggingFace
import botocore
from datasets.filesystems import S3FileSystem

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
role_name = role.split('/')[-1]

sagemaker_session_bucket = 'public-layoutlm3-training-data'

git_config = {'repo': 'https://github.com/pavel-nesterov/diploma-transformers-for-layoutLM-with-load_from_disk.git','branch': 'pavel-save-to-disk'}
instance = "ml.g4dn.xlarge"


training_input_path = f's3://{sagemaker_session_bucket}'
test_input_path = f's3://{sagemaker_session_bucket}'

huggingface_estimator = HuggingFace(
    entry_point='run_ner.py',
    source_dir='./examples/pytorch/token-classification',
    instance_type=instance,
    instance_count=1,
    role=role,
    git_config=git_config,
    transformers_version='4.17.0',
    pytorch_version='1.10.2',
    py_version='py38',
    #use_spot_instances=True,
    #max_wait=60,
    #max_run=1200,
    hyperparameters = {'model_name_or_path':'microsoft/layoutlmv3-large',
                       'output_dir':'/opt/ml/model',
                       'train_file': '/opt/ml/input/data/train/train_split.json',
                       'validation_file': '/opt/ml/input/data/test/eval_split.json',
                       'do_train': True,
                      }

)

huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})
1 Like

Hi,

What’s your Transformers version? Looks like you’re using a version prior to when LayoutLMv3 was added.

Hi Niels,
I am installing transformers as

! pip install transformers

But you just gave me several ideas for experiments:

  • upgrade transformers version
  • if failed - use the same transformers version, but for LayoutLMv2 (not 3) - it was released earlier.

If I manage to solve it - I will put the result here (hoping it may help someone)
Thank you, Niels!

I tried to increase to the next version, but no success:
ValueError: Unsupported huggingface version: 4.18.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.4.2, 4.5.0, 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.4, 4.5, 4.6, 4.10, 4.11, 4.12, 4.17.

I also added, as error message suggested, the cell at the beginning of notebook, but still the same error

This is how I tried to upgrade it

git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.18.0'}
huggingface_estimator = HuggingFace(
    entry_point='run_ner.py',
    source_dir='./examples/pytorch/token-classification',
    instance_type=instance,
    instance_count=1,
    role=role,
    git_config=git_config,
    transformers_version='4.18.0',
    pytorch_version='1.10.2',
    py_version='py38',
    #use_spot_instances=True,
    #max_wait=60,
    #max_run=1200,
    hyperparameters = {...}

Hi Pavel

Upgrading the transformers version in the notebook won’t work, because your Sagemaker training jobs run in a seperate conatiner on a seperate EC2 instance and are not associated with your notebook at all. You just use the notebook to schedule the training job, that’s all.

To make sure you use the latest / a specific transformers version in your training job you can extend the Deep Learning Container provided by AWS. I described how you can do it in this blog post: Unlock the Latest Transformer Models with Amazon SageMaker | by Heiko Hotz | Dec, 2022 | Towards Data Science

Hope that helps.

Cheers
Heiko

2 Likes

Thank you, Heiko!
I am reading it now

Really appreciate your help @marshmellow77

Hi Heiko,
Thanks again for suggesting the direction. I am trying to implement it, but it took me a day already and I still didn’t manage to make it work.
The problem is the very last step when I push the image to the ECR - this step just hangs for 10 minutes and nothing happens.
I added some echo to isolate the problem and it helped me with overcoming some lack of permissions in my Sagemaker Notebook instance. But I still cannot get what I’m doing wrong.

Here is “echo”-ed and “chmod”-ed code based on yours.

%%sh

# Specify a name and a tag
algorithm_name=huggingface-pytorch-inference-extended
tag=1.10.2-transformers4.24.0-gpu-py38-cu113-ubuntu20.04

echo "Logging algorithm name and tag"
echo "algorithm_name : $algorithm_name"
echo "tag : $tag"

account=$(aws sts get-caller-identity --query Account --output text)
echo "Logging account"
echo "account : $account"

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
echo "Logging region"
echo "region : $region"

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:${tag}"
echo "Logging fullname"
echo "fullname : $fullname"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
  echo "Logging creating ECR repository"
  aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Log into Docker
echo "Logging AWS ecr get-login-password"
sudo chown -R ec2-user:ec2-user /home/ec2-user/SageMaker/
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${account}.dkr.ecr.${region}.amazonaws.com

# Pull the image from the ECR
echo "Logging docker pull"
docker pull 763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

echo "Logging docker build"
docker build --progress=plain -t ${algorithm_name} .
echo "Logging docker tag"
docker tag ${algorithm_name} ${fullname}

# Check the status of the docker push
echo "Logging docker push status"
docker push 43****************52.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-inference-extended:1.10.2-transformers4.24.0-gpu-py38-cu113-ubuntu20.04 || echo "The docker push command failed. Check the status of the command to see the reason for hanging."

Here is an output of the code without the very last command (without “docker push”, because this is where it hangs and no output is printed)

Logging algorithm name and tag
algorithm_name : huggingface-pytorch-inference-extended
tag : 1.10.2-transformers4.24.0-gpu-py38-cu113-ubuntu20.04
Logging account
account : 439850772052
Logging region
region : eu-central-1
Logging fullname
fullname : 43***************52.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-inference-extended:1.10.2-transformers4.24.0-gpu-py38-cu113-ubuntu20.04
Logging AWS ecr get-login-password
Login Succeeded
Logging docker pull
1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04: 
Pulling from huggingface-pytorch-inference
Digest: sha256:17e776fd3295cc6dfee4e122618f5bab7ef04e87ed0490ce6b64722a60f03333
Status: Image is up to date for 763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
Logging docker build
Sending build context to Docker daemon  81.92kB
Step 1/2 : FROM 763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
 ---> 27b277000343
Step 2/2 : RUN pip install --upgrade 'transformers==4.24.0'
 ---> Using cache
 ---> 70295f109e07
Successfully built 70295f109e07
Successfully tagged huggingface-pytorch-inference-extended:latest
Logging docker tag
Logging docker push status

Oh, no
Actually, “oh, yes!”

I just waited for a little longer and I see the image in my ECR!

I will try to do the same for LayoutLMv3 now.

:man_dancing: :mirror_ball: :ballet_shoes:

The testing notebook worked with zero modifications (in the Sagemaker notebook instance with the Administrator IAM policy attached).

I am trying to do the same for training container now.

Done, I managed to push the training container to my ECR. Now I will try to run the LayoutLMv3 training with an updated image.

Thank you both for help! @nielsr @marshmellow77 :handshake:

@pavel-nesterov - it looks like you extended the inference DLC, is that right? Since you want to finetune the model you might want to use and extend the training DLC instead.

Hi Heiko,
At first, I run the code exactly like in your article (and repo) - extending the inference container. But as a next step, I took the training container and extended it by modifying some of the code you have in the article. Here is my updated code (maybe it will help someone): there is an extra line with “sudo…”
Thank you once again for help @marshmellow77

%%writefile Dockerfile
FROM 763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
RUN pip install --upgrade 'transformers==4.24.0'
%cd ~/SageMaker
%%sh

# Specify a name and a tag
algorithm_name=huggingface-pytorch-training-extended
tag=1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04

echo "Logging algorithm name and tag"
echo "algorithm_name : $algorithm_name"
echo "tag : $tag"

account=$(aws sts get-caller-identity --query Account --output text)
echo "Logging account"
echo "account : $account"

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
echo "Logging region"
echo "region : $region"

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:${tag}"
echo "Logging fullname"
echo "fullname : $fullname"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
  echo "Logging creating ECR repository"
  aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Log into Docker
echo "Logging AWS ecr get-login-password"
sudo chown -R ec2-user:ec2-user /home/ec2-user/SageMaker/
aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.eu-central-1.amazonaws.com

# Pull the image from the ECR
echo "Logging docker pull"
docker pull 763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

echo "Logging docker build"
docker build --progress=plain -t ${algorithm_name} .
echo "Logging docker tag"
docker tag ${algorithm_name} ${fullname}

# Check the status of the docker push
#echo "Logging docker push status"
echo "Logging docker push status"
docker push ${fullname}

The training has started, but there is another problem, which is out of the scope of this topic (there is something wrong with input data). I need to read one of the posts from @nielsr about the dataset for LayoutLMv3