I’m building a model for NER task and use tokenizer like this:
model_name = "bert-large-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sequence = ['Ash', 'punches', 'three', 'buttons', '.', 'An', 'X-ray', 'image', 'appears', '.']
inputs = tokenizer(sequence, is_split_into_words=True, return_tensors="pt")
tokens = inputs.tokens()
print(inputs.tokens())
---
#['[CLS]', 'Ash', 'punches', 'three', 'buttons', '.', 'An', 'X', '-', 'ray', 'image', 'appears', '.', '[SEP]']
As you can see, the tokenizer splits the word X-ray even if I feed it already split sequence.
How to keep it together or better split like ['X', '##-', '##ray']
instead?