Input data for LayoutLMv3 on Sagemaker

pavel-nesterov · December 22, 2022, 12:07pm

Hi everyone,

I am trying to finetune LayoutLMv3 on my own dataset using Sagemaker. The problem I’m having is the input data that I supply to the training job and I cannot find an answer what is the right way to do that on Sagemaker. I see good articles about how to do it in Colab or locally (and it works locally), but I cannot make it work with training job

This is the error I’m getting:

[INFO|modeling_utils.py:2608] 2022-12-22 09:30:07,941 >> All model checkpoint weights were used when initializing LayoutLMv3ForTokenClassification.
[WARNING|modeling_utils.py:2610] 2022-12-22 09:30:07,941 >> Some weights of LayoutLMv3ForTokenClassification were not initialized from the model checkpoint at microsoft/layoutlmv3-large and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Traceback (most recent call last):
  File "run_ner.py", line 631, in <module>
main()
File "run_ner.py", line 412, in main
       if label.startswith("B-") and label.replace("B-", "I-") in label_list:
AttributeError: 'int' object has no attribute 'startswith'

This is how I start the training job

training_input_path = f's3://{sagemaker_session_bucket}'
test_input_path = f's3://{sagemaker_session_bucket}'

huggingface_estimator = HuggingFace(
    entry_point='run_ner.py',
    image_uri = '43**********52.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-training-extended:1.10.2-transformers4.24.0-gpu-py38-cu113-ubuntu20.04',
    source_dir='./examples/pytorch/token-classification',
    instance_type=instance,
    instance_count=1,
    role=role,
    git_config=git_config,
    pytorch_version='1.10.2',
    py_version='py38',
    hyperparameters = {'model_name_or_path':'microsoft/layoutlmv3-large',
                       'output_dir':'/opt/ml/model',
                       'train_file': '/opt/ml/input/data/train/train_split.json',
                       'validation_file': '/opt/ml/input/data/test/eval_split.json',
                       'do_train': True,
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

And this is the full first line from the “train_file”:

{"pixel_values":-0.6000000238,"input_ids":[0,289,21600,2076,5758,534,13565,16712,211,4,673,4,673,4,229,3917,21978,2348,740,17838,112,316,1244,10920,1417,2426,4576,992,102,27932,846,4832,17982,466,17573,2022,2036,18069,12334,21958,769,329,479,16041,571,4236,14285,338,5107,1589,992,523,11273,4236,8384,1176,17971,992,7413,479,3675,571,5981,1728,992,102,14091,112,6,40935,14091,1577,14091,3023,12334,4,37032,4,29,625,4,179,992,523,479,4856,415,4,330,783,1322,992,102,14091,321,6,14586,12334,4,37032,4,29,625,4,179,992,523,479,9371,1168,2156,13102,19866,7975,4193,12334,8434,4,104,1180,705,4,288,6,40670,4236,8384,661,4,2478,449,1517,21594,4862,5410,12334,181,4654,11273,4955,2133,475,12999,39506,4,4567,337,267,4,17655,821,12334,4236,8384,1176,22079,2497,35000,4,705,11491,329,4,2478,14091,321,6,37319,14091,3023,12334,4,37032,4,29,625,4,179,992,523,479,23879,4,330,16160,2463,38679,571,229,4550,853,449,12597,3144,1764,571,229,4550,853,449,12597,3144,1764,571,28451,4,267,2154,479,112,6,246,7606,10775,571,132,3023,17938,4,27076,479,11061,571,13451,4987,4,4654,90,2070,4955,2348,727,571,13451,4987,4,4654,90,2070,4955,2348,727,571,21299,1243,4840,594,2348,21299,1243,4840,594,2348,23634,1113,7458,19670,479,5773,571,155,3023,17971,1589,14848,2001,479,1878,571,468,241,4955,479,992,102,5278,594,118,40296,229,20670,3330,571,3713,7399,26771,10529,571,17971,1589,14848,2001,479,1878,571,10394,4203,4654,282,1910,542,1417,4955,1459,705,111,30593,2403,1589,14848,2001,479,1878,571,3713,7399,26771,10529,571,5811,329,330,705,4,32059,4,26976,4,4017,571,12334,449,230,7975,4193,17925,313,13571,4699,732,658,102,732,4489,316,571,221,7085,4,642,763,4654,1951,195,6,176,9043,7279,25113,111,10367,4832,381,9822,111,10367,4832,381,9822,111,1608,1343,4832,381,11615,8526,111,440,4832,971,4,1225,4,24837,2393,119,111,10367,4832,2393,1178,479,1608,1343,111,230,3999,4832,46423,479,8302,4832,6208,1343,111,10367,5480,111,381,11615,10353,4832,112,6,1646,10353,1589,14091,83,361,6,245,7606,5008,139,163,820,6,288,7606,5008,139,112,6,4563,10353,1589,14091,654,6,288,7606,289,21600,2076,230,28737,2908,21726,10273,479,30415,7831,347,11250,132,6,5220,10353,1589,14091,321,6,3414,132,6,2831,654,6,288,7606,10353,10353,2908,6,2890,27932,846,501,6,3416,27932,846,132,6,3414,83,321,6,2831,83,155,6,2831,83,112,6,1922,83,321,6,3933,163,321,6,5606,83,321,6,3933,163,2],"attention_mask":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],"bbox":0,"labels":[-100,2,-100,-100,2,-100,-100,-100,2,-100,-100,-100,-100,-100,2,-100,-100,-100,2,-100,2,2,-100,2,-100,-100,2,2,-100,2,-100,2,2,-100,-100,-100,-100,2,16,11,11,-100,11,0,-100,16,16,-100,-100,11,11,-100,-100,0,0,-100,16,11,11,11,0,-100,16,-100,11,-100,0,14,-100,-100,24,2,24,2,16,-100,-100,-100,-100,-100,-100,-100,11,-100,0,16,-100,-100,-100,-100,-100,11,-100,0,14,-100,-100,16,-100,-100,-100,-100,-100,-100,-100,11,-100,0,16,-100,11,0,-100,8,-100,16,0,-100,-100,-100,-100,-100,-100,-100,-100,16,16,-100,-100,-100,0,-100,-100,-100,-100,16,11,-100,-100,-100,-100,0,-100,16,-100,-100,-100,-100,-100,-100,0,16,11,11,-100,11,0,16,-100,-100,11,-100,-100,-100,0,14,-100,-100,24,2,16,-100,-100,-100,-100,-100,-100,-100,11,-100,0,16,-100,-100,-100,-100,0,-100,16,-100,-100,11,-100,-100,0,-100,16,-100,-100,11,-100,-100,0,-100,16,-100,-100,-100,11,11,-100,-100,11,0,-100,14,2,16,-100,-100,11,0,-100,16,-100,-100,-100,-100,-100,-100,-100,0,-100,16,-100,-100,-100,-100,-100,-100,-100,0,-100,16,-100,0,-100,-100,16,-100,0,-100,-100,16,-100,11,-100,11,0,-100,14,2,16,11,11,-100,11,0,-100,16,-100,-100,11,11,-100,11,-100,-100,0,16,-100,0,-100,16,-100,-100,0,-100,16,11,11,-100,11,0,-100,2,-100,-100,-100,-100,2,-100,-100,-100,-100,19,-100,-100,10,10,-100,10,20,-100,16,-100,-100,0,-100,12,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,12,2,12,8,-100,16,0,-100,-100,16,-100,-100,11,-100,0,-100,16,-100,-100,-100,-100,-100,-100,0,-100,-100,-100,2,-100,2,2,2,2,-100,2,2,2,2,-100,2,2,-100,2,2,-100,2,2,2,2,2,-100,-100,-100,-100,2,-100,2,2,2,2,-100,2,2,-100,2,2,-100,2,2,2,2,2,2,-100,2,2,2,2,2,-100,2,2,26,-100,-100,2,2,2,2,2,-100,-100,2,2,-100,2,2,-100,-100,2,2,-100,26,-100,-100,2,2,2,1,-100,-100,27,2,-100,-100,2,-100,2,2,-100,2,2,-100,-100,-100,26,-100,-100,2,2,2,26,-100,-100,26,-100,-100,1,-100,-100,27,2,2,2,-100,-100,2,-100,2,-100,-100,2,-100,23,-100,-100,2,23,-100,-100,2,23,-100,-100,2,23,-100,-100,2,23,-100,-100,2,23,-100,-100,2,23,-100,-100,2,-100]}

Apparently, I need to pass labels with “S-” and “I-” in the input_ids (now I’m passing integers), but I cannot find examples of the right input to the sagmeker job.

Can you please guide me in finding an answer to this showstopper?

crajah · January 26, 2023, 3:22pm

What are these hyper parameters for? Are they the paths to the training data?

'train_file': '/opt/ml/input/data/train/train_split.json',
'validation_file': '/opt/ml/input/data/test/eval_split.json',

If so, they are conflicting with the parameters you sent to the estimator during training.

huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

Typically, you would send the training file path and test file path to via the estimator fit() call. In your training script, you would pick those values up from the command line as so:

import argparse

def parse_args():
  parser = argparse.ArgumentParser()

  parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
  parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))

  return parser.parse_known_args()

Topic		Replies	Views
LayoutLMV3 on Sagemaker Amazon SageMaker	11	1835	December 19, 2022
How to train LayoutLMv2 on the Sequence Classification task in AWS Sagemaker? Amazon SageMaker	4	1620	August 4, 2022
Sagemaker Serverless Inference for LayoutLMv2 model Amazon SageMaker	17	4376	June 15, 2022
Train end-to-end text classication on sagemaker Amazon SageMaker	5	532	October 11, 2021
'DistributedDataParallel' object has no attribute 'no_sync' Amazon SageMaker	8	5573	February 8, 2022

Input data for LayoutLMv3 on Sagemaker

Related topics