Hello @philschmid I hope you are doing well. Question for you, do you an example of the expected format of the data in order to been able to use this script ( run_ner.py) in sagemaer training?
Thanks,
Jorge
Hello @philschmid I hope you are doing well. Question for you, do you an example of the expected format of the data in order to been able to use this script ( run_ner.py) in sagemaer training?
Thanks,
Jorge
Hey,
you can find the data format of all examples/
always inside the script. For run_ner.py
it is here: transformers/run_ner.py at b518aaf193938247f698a7c4522afe42b025225a · huggingface/transformers · GitHub
if data_args.text_column_name is not None:
text_column_name = data_args.text_column_name
elif "tokens" in column_names:
text_column_name = "tokens"
else:
text_column_name = column_names[0]
if data_args.label_column_name is not None:
label_column_name = data_args.label_column_name
elif f"{data_args.task_name}_tags" in column_names:
label_column_name = f"{data_args.task_name}_tags"
else:
label_column_name = column_names[1]
In detail, you can either define text_column_name
& label_column_name
as hyperparameter
to the define the column/key of your text/token and label field is. If you are not defining something it will pick index 0
for text/token and 1
for the label.
You can provide your dataset in data file formats, which are compatible with the datasets
library, e.g. csv
, json
more to this here: Loading a Dataset — datasets 1.11.0 documentation
Thank you again Phillip.
Good afternoon @philschmid,
I did notice that the script train.py of the notebook for multi-class text classification on SageMaker is different than the script run_glue.py: the tokenization is doing in the notebook (in order to process data in the notebook instance with CPU and the tokenized data is sent to s3), not in the script train.py.
The reason is clear (why using GPU of the SageMaker training instance for tokenization when CPU is available in the notebook instance with a lower cost) but it means we need to change this for all tasks.
In particular, I’m interesting in the NER task.
Hey @pierreguillou,
The examples/
scripts are created to have “examples” for NLP, Speech & vision tasks, which work on most available setups, mean SageMaker, Colab, local, VMs, etc. That’s why the script contains the e2e example for pre-processing, training, evaluation, and testing.
I agree with you that moving the CPU tasks (processing) to a CPU makes more sense, but this doesn’t mean you cannot use run_ner.py
to get started. The benefit of the examples/
scripts is that we make sure they stay compatible and work with datasets from the datasets
library, which can save a lot of time for the community to not rewrite scripts for tasks/benchmarks.
In addition, the processing part for NER for a decently sized dataset (< 50k data points) should take a maximum of a few minutes, which means a few cents.
To answer your two questions:
Did anyone already publish a*HF SageMaker notebook for NER with the associated script ?
Not that i am aware of but you could easily adjust the run_ner.py
or you could adjust the train.py
to make it compatible fore NER. If you are not sure how you should do this. You can take a look at our course Token classification - Hugging Face NLP Course
In general, Hugging Face plans to publish all these HF SageMaker notebooks for NLP tasks or not?
In quick no, since we have the examples/
scripts, which are supported.
Thanks @philschmid.
I did succeed in using run_ner.py
in a AWS SageMaker notebook instance
Now, how to use predictor.predict()
with arguments I can use in pipeline()? For example, I would like to use grouped_entities=True
. In pipeline(), I do:
nlp = pipeline("ner", grouped_entities=True)
In predictor.predict()
?
This is documented here: Reference
You can provide the pipeline’s kwargs
in the parameter
key when sending the request.
So @pierreguillou , how did youmanage to get this done something like:
data= {
"inputs": "...",
"parameters": {
# "aggregation_strategy": "SIMPLE"
"grouped_entities": True
}
# request
predictor.predict(data)
?
This made no difference in the grouping unfortunately…
Also @philschmid there seems to be a typo in the docs
{
"inputs": "Hugging Face, the winner of VentureBeat’s Innovation in Natural Language Process/Understanding Award for 2021, is looking to level the playing field. The team, launched by Clément Delangue and Julien Chaumond in 2016, was recognized for its work in democratizing NLP, the global market value for which is expected to hit $35.1 billion by 2026. This week, Google’s former head of Ethical AI Margaret Mitchell joined the team.",
"paramters": {
"repetition_penalty": 4.0,
"length_penalty": 1.5
}
}
note paramters instead of parameters,
@philschmid seems like this doesn’t really work for aggregation_strategy on token-classification tasks.
Also, the latest sagemaker image is transformers==4.6 (doesn’t include aggregation_strategy)
@Ilias you can find an overview of the available DLCs + versions including transformers_version
in the documentation here: Reference
Wow that’s new! Thank you
I customized the model_fn
function in inference.py with grouped_entities
to make it work