NER on SageMaker Run run_ner.py

Hello @philschmid I hope you are doing well. Question for you, do you an example of the expected format of the data in order to been able to use this script ( run_ner.py) in sagemaer training?

Thanks,

Jorge

Hey,

you can find the data format of all examples/ always inside the script. For run_ner.py it is here: transformers/run_ner.py at b518aaf193938247f698a7c4522afe42b025225a · huggingface/transformers · GitHub

    if data_args.text_column_name is not None:
        text_column_name = data_args.text_column_name
    elif "tokens" in column_names:
        text_column_name = "tokens"
    else:
        text_column_name = column_names[0]

    if data_args.label_column_name is not None:
        label_column_name = data_args.label_column_name
    elif f"{data_args.task_name}_tags" in column_names:
        label_column_name = f"{data_args.task_name}_tags"
    else:
        label_column_name = column_names[1]

In detail, you can either define text_column_name & label_column_name as hyperparameter to the define the column/key of your text/token and label field is. If you are not defining something it will pick index 0 for text/token and 1 for the label.

You can provide your dataset in data file formats, which are compatible with the datasets library, e.g. csv, json more to this here: Loading a Dataset — datasets 1.11.0 documentation

1 Like

Thank you again Phillip.

1 Like

Good afternoon @philschmid,

I did notice that the script train.py of the notebook for multi-class text classification on SageMaker is different than the script run_glue.py: the tokenization is doing in the notebook (in order to process data in the notebook instance with CPU and the tokenized data is sent to s3), not in the script train.py.

The reason is clear (why using GPU of the SageMaker training instance for tokenization when CPU is available in the notebook instance with a lower cost) but it means we need to change this for all tasks.

In particular, I’m interesting in the NER task.

  • Did anyone already publish a HF SageMaker notebook for NER with the associated script?
  • In general, Hugging Face plans to publish all these HF SageMaker notebooks for NLP tasks or not?

Hey @pierreguillou,

The examples/ scripts are created to have “examples” for NLP, Speech & vision tasks, which work on most available setups, mean SageMaker, Colab, local, VMs, etc. That’s why the script contains the e2e example for pre-processing, training, evaluation, and testing.
I agree with you that moving the CPU tasks (processing) to a CPU makes more sense, but this doesn’t mean you cannot use run_ner.py to get started. The benefit of the examples/ scripts is that we make sure they stay compatible and work with datasets from the datasets library, which can save a lot of time for the community to not rewrite scripts for tasks/benchmarks.
In addition, the processing part for NER for a decently sized dataset (< 50k data points) should take a maximum of a few minutes, which means a few cents.

To answer your two questions:

Did anyone already publish a*HF SageMaker notebook for NER with the associated script ?

Not that i am aware of but you could easily adjust the run_ner.py or you could adjust the train.py to make it compatible fore NER. If you are not sure how you should do this. You can take a look at our course Token classification - Hugging Face NLP Course

In general, Hugging Face plans to publish all these HF SageMaker notebooks for NLP tasks or not?

In quick no, since we have the examples/ scripts, which are supported.

Thanks @philschmid.

I did succeed in using run_ner.py in a AWS SageMaker notebook instance :slight_smile:

Now, how to use predictor.predict() with arguments I can use in pipeline()? For example, I would like to use grouped_entities=True. In pipeline(), I do:

nlp = pipeline("ner", grouped_entities=True)

In predictor.predict()?

1 Like

This is documented here: Reference
You can provide the pipeline’s kwargs in the parameter key when sending the request.

1 Like

So @pierreguillou , how did youmanage to get this done something like:

data= {
    "inputs": "...",
    "parameters": {
        # "aggregation_strategy": "SIMPLE"
        "grouped_entities": True
    }

# request
predictor.predict(data)

?
This made no difference in the grouping unfortunately…

Also @philschmid there seems to be a typo in the docs

{
  "inputs": "Hugging Face, the winner of VentureBeat’s Innovation in Natural Language Process/Understanding Award for 2021, is looking to level the playing field. The team, launched by Clément Delangue and Julien Chaumond in 2016, was recognized for its work in democratizing NLP, the global market value for which is expected to hit $35.1 billion by 2026. This week, Google’s former head of Ethical AI Margaret Mitchell joined the team.",
  "paramters": {
    "repetition_penalty": 4.0,
    "length_penalty": 1.5
  }
}

note paramters instead of parameters,

2 Likes

@philschmid seems like this doesn’t really work for aggregation_strategy on token-classification tasks.

Also, the latest sagemaker image is transformers==4.6 (doesn’t include aggregation_strategy)

@Ilias you can find an overview of the available DLCs + versions including transformers_version in the documentation here: Reference

Wow that’s new! Thank you :ok_hand:

I customized the model_fn function in inference.py with grouped_entities to make it work

1 Like