Input data for LayoutLMv3 on Sagemaker

Hi everyone,

I am trying to finetune LayoutLMv3 on my own dataset using Sagemaker. The problem I’m having is the input data that I supply to the training job and I cannot find an answer what is the right way to do that on Sagemaker. I see good articles about how to do it in Colab or locally (and it works locally), but I cannot make it work with training job

This is the error I’m getting:

[INFO|modeling_utils.py:2608] 2022-12-22 09:30:07,941 >> All model checkpoint weights were used when initializing LayoutLMv3ForTokenClassification.
[WARNING|modeling_utils.py:2610] 2022-12-22 09:30:07,941 >> Some weights of LayoutLMv3ForTokenClassification were not initialized from the model checkpoint at microsoft/layoutlmv3-large and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Traceback (most recent call last):
  File "run_ner.py", line 631, in <module>
main()
File "run_ner.py", line 412, in main
       if label.startswith("B-") and label.replace("B-", "I-") in label_list:
AttributeError: 'int' object has no attribute 'startswith'

This is how I start the training job

training_input_path = f's3://{sagemaker_session_bucket}'
test_input_path = f's3://{sagemaker_session_bucket}'

huggingface_estimator = HuggingFace(
    entry_point='run_ner.py',
    image_uri = '43**********52.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-training-extended:1.10.2-transformers4.24.0-gpu-py38-cu113-ubuntu20.04',
    source_dir='./examples/pytorch/token-classification',
    instance_type=instance,
    instance_count=1,
    role=role,
    git_config=git_config,
    pytorch_version='1.10.2',
    py_version='py38',
    hyperparameters = {'model_name_or_path':'microsoft/layoutlmv3-large',
                       'output_dir':'/opt/ml/model',
                       'train_file': '/opt/ml/input/data/train/train_split.json',
                       'validation_file': '/opt/ml/input/data/test/eval_split.json',
                       'do_train': True,
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

And this is the full first line from the “train_file”:

{"pixel_values":-0.6000000238,"input_ids":[0,289,21600,2076,5758,534,13565,16712,211,4,673,4,673,4,229,3917,21978,2348,740,17838,112,316,1244,10920,1417,2426,4576,992,102,27932,846,4832,17982,466,17573,2022,2036,18069,12334,21958,769,329,479,16041,571,4236,14285,338,5107,1589,992,523,11273,4236,8384,1176,17971,992,7413,479,3675,571,5981,1728,992,102,14091,112,6,40935,14091,1577,14091,3023,12334,4,37032,4,29,625,4,179,992,523,479,4856,415,4,330,783,1322,992,102,14091,321,6,14586,12334,4,37032,4,29,625,4,179,992,523,479,9371,1168,2156,13102,19866,7975,4193,12334,8434,4,104,1180,705,4,288,6,40670,4236,8384,661,4,2478,449,1517,21594,4862,5410,12334,181,4654,11273,4955,2133,475,12999,39506,4,4567,337,267,4,17655,821,12334,4236,8384,1176,22079,2497,35000,4,705,11491,329,4,2478,14091,321,6,37319,14091,3023,12334,4,37032,4,29,625,4,179,992,523,479,23879,4,330,16160,2463,38679,571,229,4550,853,449,12597,3144,1764,571,229,4550,853,449,12597,3144,1764,571,28451,4,267,2154,479,112,6,246,7606,10775,571,132,3023,17938,4,27076,479,11061,571,13451,4987,4,4654,90,2070,4955,2348,727,571,13451,4987,4,4654,90,2070,4955,2348,727,571,21299,1243,4840,594,2348,21299,1243,4840,594,2348,23634,1113,7458,19670,479,5773,571,155,3023,17971,1589,14848,2001,479,1878,571,468,241,4955,479,992,102,5278,594,118,40296,229,20670,3330,571,3713,7399,26771,10529,571,17971,1589,14848,2001,479,1878,571,10394,4203,4654,282,1910,542,1417,4955,1459,705,111,30593,2403,1589,14848,2001,479,1878,571,3713,7399,26771,10529,571,5811,329,330,705,4,32059,4,26976,4,4017,571,12334,449,230,7975,4193,17925,313,13571,4699,732,658,102,732,4489,316,571,221,7085,4,642,763,4654,1951,195,6,176,9043,7279,25113,111,10367,4832,381,9822,111,10367,4832,381,9822,111,1608,1343,4832,381,11615,8526,111,440,4832,971,4,1225,4,24837,2393,119,111,10367,4832,2393,1178,479,1608,1343,111,230,3999,4832,46423,479,8302,4832,6208,1343,111,10367,5480,111,381,11615,10353,4832,112,6,1646,10353,1589,14091,83,361,6,245,7606,5008,139,163,820,6,288,7606,5008,139,112,6,4563,10353,1589,14091,654,6,288,7606,289,21600,2076,230,28737,2908,21726,10273,479,30415,7831,347,11250,132,6,5220,10353,1589,14091,321,6,3414,132,6,2831,654,6,288,7606,10353,10353,2908,6,2890,27932,846,501,6,3416,27932,846,132,6,3414,83,321,6,2831,83,155,6,2831,83,112,6,1922,83,321,6,3933,163,321,6,5606,83,321,6,3933,163,2],"attention_mask":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],"bbox":0,"labels":[-100,2,-100,-100,2,-100,-100,-100,2,-100,-100,-100,-100,-100,2,-100,-100,-100,2,-100,2,2,-100,2,-100,-100,2,2,-100,2,-100,2,2,-100,-100,-100,-100,2,16,11,11,-100,11,0,-100,16,16,-100,-100,11,11,-100,-100,0,0,-100,16,11,11,11,0,-100,16,-100,11,-100,0,14,-100,-100,24,2,24,2,16,-100,-100,-100,-100,-100,-100,-100,11,-100,0,16,-100,-100,-100,-100,-100,11,-100,0,14,-100,-100,16,-100,-100,-100,-100,-100,-100,-100,11,-100,0,16,-100,11,0,-100,8,-100,16,0,-100,-100,-100,-100,-100,-100,-100,-100,16,16,-100,-100,-100,0,-100,-100,-100,-100,16,11,-100,-100,-100,-100,0,-100,16,-100,-100,-100,-100,-100,-100,0,16,11,11,-100,11,0,16,-100,-100,11,-100,-100,-100,0,14,-100,-100,24,2,16,-100,-100,-100,-100,-100,-100,-100,11,-100,0,16,-100,-100,-100,-100,0,-100,16,-100,-100,11,-100,-100,0,-100,16,-100,-100,11,-100,-100,0,-100,16,-100,-100,-100,11,11,-100,-100,11,0,-100,14,2,16,-100,-100,11,0,-100,16,-100,-100,-100,-100,-100,-100,-100,0,-100,16,-100,-100,-100,-100,-100,-100,-100,0,-100,16,-100,0,-100,-100,16,-100,0,-100,-100,16,-100,11,-100,11,0,-100,14,2,16,11,11,-100,11,0,-100,16,-100,-100,11,11,-100,11,-100,-100,0,16,-100,0,-100,16,-100,-100,0,-100,16,11,11,-100,11,0,-100,2,-100,-100,-100,-100,2,-100,-100,-100,-100,19,-100,-100,10,10,-100,10,20,-100,16,-100,-100,0,-100,12,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,12,2,12,8,-100,16,0,-100,-100,16,-100,-100,11,-100,0,-100,16,-100,-100,-100,-100,-100,-100,0,-100,-100,-100,2,-100,2,2,2,2,-100,2,2,2,2,-100,2,2,-100,2,2,-100,2,2,2,2,2,-100,-100,-100,-100,2,-100,2,2,2,2,-100,2,2,-100,2,2,-100,2,2,2,2,2,2,-100,2,2,2,2,2,-100,2,2,26,-100,-100,2,2,2,2,2,-100,-100,2,2,-100,2,2,-100,-100,2,2,-100,26,-100,-100,2,2,2,1,-100,-100,27,2,-100,-100,2,-100,2,2,-100,2,2,-100,-100,-100,26,-100,-100,2,2,2,26,-100,-100,26,-100,-100,1,-100,-100,27,2,2,2,-100,-100,2,-100,2,-100,-100,2,-100,23,-100,-100,2,23,-100,-100,2,23,-100,-100,2,23,-100,-100,2,23,-100,-100,2,23,-100,-100,2,23,-100,-100,2,-100]}

Apparently, I need to pass labels with “S-” and “I-” in the input_ids (now I’m passing integers), but I cannot find examples of the right input to the sagmeker job.

Can you please guide me in finding an answer to this showstopper?

What are these hyper parameters for? Are they the paths to the training data?

'train_file': '/opt/ml/input/data/train/train_split.json',
'validation_file': '/opt/ml/input/data/test/eval_split.json',

If so, they are conflicting with the parameters you sent to the estimator during training.

huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

Typically, you would send the training file path and test file path to via the estimator fit() call. In your training script, you would pick those values up from the command line as so:

import argparse

def parse_args():
  parser = argparse.ArgumentParser()

  parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
  parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))

  return parser.parse_known_args()