Hi there!
I have a doubt about the tokenizers of the transformers models, and to simplify I will restrict the case to BERT.
BERT uses the wordpiece tokenizer therefore its features are word pieces and some whole words, the advantages of this approach are: a small dictionary, robustness to new words and computational efficiency.
I’m doing a project in text-classification and an important module of this project is about XAI, I was required to use SHAP to explain BERT’s classification.
We discovered without much surprise that SHAP highlights word pieces to explain BERT’s classification, but for my boss this isn’t good for him.
He wanted that SHAP highlights words and not piece of words but with wordpiece tokenizer it’s impossible because that algorithm works for split words!
Now he has asked me to train a new tokenizer that processes word-level features for BERT, but I am not very convinced about this strategy because
- hugginface does not provide a word-level tokenizer to make from scracht;
- it would result in a huge dictionary that would have to be built from the domain documents, which are very few (about a hundred documents), plus a huge generic data corpus like the one used by BERT and roBERTa to try to have as many words as possible and increase the resilience to new words;
- it is a computationally expensive operation in terms of time, effort and resources, because BERT would then have to be retrained because its knowledge was based on the features extracted from the word piece tokeniser, and to get things right it would then have to redo a hyperparametisation, which would be very challenging as I do everything.
- I don’t found papers (IEEE, elsever journals) or on web no one train word-level tokenizer to make from scracht for BERT and co.
I am convinced that the best way forward is to try a post-processing strategy where it is possible to reconstruct the information by retrieving the word embeddings of the split words and obtain the word embeddings of the reconstructed word and then understand how this can be used by SHAP.
Did anyone train from scratch a word-level tokenizer?
Did anyone use a post processing tecnique to retrive the correct embedding and rappresentation of a split word for SHAP?