How to add the uppercase token (and this behaviour) to tokenizers?

jonathanalis · June 27, 2022, 3:34am

I observed that the tokenizers, in majority of cases, first transform the text to lowercase.
Then when creating a tokenizer vocabulary, uppercase and lowercase tokens mean the same thing. I dont want to create another vocabulary entry for the same meaning term.

However, we lost the meaning of the word being in uppercase. For instance, uppercase words appear more frequently in titles, therefore, just by being a uppercase word can give to the model the information that the word can be part of a title. Likewise, uppercase first letter can indicate proper name, country, beggining of sentence, etc. So, uppercasing indeed give valuable information.

So, I think the best solution for this dilema is creating a subword token that indicate that the next word is in uppercase, and other indicating the next word has the first letter in uppercase. And also embed this behavior in the tokenizers functionality.

However, I have no idea how to do this (without creating the tokenizer functions from scratch).
There some easy ways to do so with some of the tokenizer classes?
Specially, I am willing to work with SentencePiece tokenizers (instead of word piece).
The guides I find do not go that deep in the customization of the behavior of the tokenizer.

Thanks in advance

jon-fernandes · July 2, 2022, 3:56pm

You could use a checkpoint like ‘bert-base-cased’ that maintains case.

from transformers import AutoTokenizer
checkpoint = 'bert-base-cased'

sentence = "Sir Tom Walters"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inp = tokenizer(sentence, return_tensors='pt')
inp.tokens()
['[CLS]', 'Sir', 'Tom', 'Walters', '[SEP]']

Topic		Replies	Views
Add new tokens for subwords 🤗Tokenizers	9	6829	August 11, 2020
Why Bert-chinese use do_lower_case=False? 🤗Tokenizers	0	482	December 24, 2020
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	380	March 19, 2021
Adding new tokens while preserving tokenization of adjacent tokens 🤗Tokenizers	4	18766	January 25, 2024
Bert pretrained tokenizer: how to preserve hyphened words? Beginners	0	311	April 6, 2022

How to add the uppercase token (and this behaviour) to tokenizers?

Related topics