The gpt2 tokenizer still contains extra tokens beyond those I wanted in the initial_alphabet, but the gpt2 model performs reasonably well at char-level.
I am working with bank transactions and some examples of the merchant names are
āTesco Superā, āTesco Supermaā, āPaypal * Tescosupermarā, āZilch * tescosupermarket - 343ā which have very similar meaning.
I am currently using a sentence embedding based on average of words (universal sentence embedding) to classify these transactions into supermarkets, hotels, restaurantsā¦ it works ok, but it could be better, so I want to create a sentence embedding that is based in the average of character embeddings.
Is there a language model in LLM that was trained on character tokens and from which I can
extract a embedding combining the meaning of characters? I know there are some token tokenizers, but which models are compatible with a token tokenizerā¦
I know there are is Char BERT and Character BERT, but those still use word tokenizer, just that the meaning of each word is based on charactersā¦ and for me the space " " is just another character, does not always mean a new wordā¦ and I want my embedding to consider the space is just another character with a meaning, also * is a character with an important meaning.
Randomly stumbled on this as Iām looking for a character-level encoding model myself. However, for your case, Iād suggest trying a model from the MTEB leaderboard that does well with classification and is smaller (and hopefully faster). See if you actually get embeddings that are dissimilar for your examples. If you do, you can fine-tune your model. Feed it triples of (query, positive_example, random_negative_example). The query and positive examples would be like the ones you have. Do some googling on triplet loss function with pytorch for details. Good luck!