I understand that NER task in transformers involves tokenization, embedding, and classification for tokens that are either whole words or partial words. Is it possible to adapt this task for chunk/sentence embedding and classification?
I suppose there should be a tokenizer to make chunks / senstences for embedding rather than words / subwords?
Is this the answer ? Summary of the tokenizers — transformers 4.3.0 documentation
Please do not write multiple posts so close to each other. Instead simply edit the original post.
It is more general than you seem to imply. NER (or any classification task) is simply such a task because the last layer(s) are specific to that task. Everything that comes before (the embedding layer and all following layers) are structurally identical across tasks. So for token classification, the last layer(s) (also called “head”) is a token classification head which ensures that the layer ouputs logits for each token. In your case, you want a sentence classification head which only ouputs one logits/score per sequence (rather than per token).
So in your case you can use XXXForSequenceClassification, e.g. BertForSequenceClassification, which expects one label per input sequence.
The tokenization depends on the model that you use. So for BERT, use BertTokenizer, and so on. Note that depending on the tokenizer, you may want to pre-tokenize the text for the best results: the BertTokenizer will tokenize the input in subword tokens, but it may be useful to do word-splitting before hand with a tool like spaCy or stanza. As an example: “Grandma’s cookies” is probably better tokenized as “Grandma 's cookies”. (Notice the space.) For tokenizers using sentencepiece this is not needed.
Thank you very much for your advice, for sentence NER in a specific domain (my case is clinical diagnosis) do you believe that BERT is the model to try to, or would you suggest something else? There seem to be quite a few options to chose from.