How to make tokenizer convert subword token to an independent token?

Khondoker · September 7, 2020, 6:31am

Recently, I have been using bert-base-multilingual-cased for my work in bengali. When I feed in a sentence like “আজকে হবে না” to BertTokenizer, I get the following output.

Sentence:  আজকে হবে না
Tokens:  ['আ', '##জ', '##কে', 'হবে', 'না']
To Int:  [938, 24383, 18243, 53761, 26109]

But when I fed in a sentence like “আজকে হবেনা” with “না” not spaced, I see tokens been “##না” with corresponding index also been changed

Sentence:  আজকে হবেনা
Tokens:  ['আ', '##জ', '##কে', 'হবে', '##না']
To Int:  [938, 24383, 18243, 53761, 20979]

Now, I was hoping is there anyway to let tokenizer to know that if they find anything like ‘##না’, convert them to ‘না’ for all such cases.

rgwatwormhill · September 9, 2020, 12:05am

It might be easier to replace the না in the sentence with “space” না before you tokenize.

Is it just the ##না that is a problem, or do you want to get rid of all the ## continuation tokens?

Khondoker · September 9, 2020, 5:28am

Recently I realized, it is not for all না.

And no, not for all ## continuation tokens, for few only.

BramVanroy · September 9, 2020, 8:15am

Can’t you just replace the tokens before converting them to IDs?

# set of all tokens that should you be replaced
NEED_REPL = {"##না"}

def retokenize(tokens):
    return [t.replace("##", "", 1) if t in NEED_REPL else t for t in tokens]


tokens = ['আ', '##জ', '##কে', 'হবে', '##না']
replaced = retokenize(tokens)
print(replaced)
# ['আ', '##জ', '##কে', 'হবে', 'না']

Khondoker · September 9, 2020, 10:09am

Yes but when I pass in the whole dataset for the tokenizer to handle, I had to do something like

    encoding = self.tokenizer.encode_plus(
    reviews,
    add_special_tokens = True,
    max_length = self.max_len,
    return_token_type_ids=True,
    pad_to_max_length=True,
    return_attention_mask=True,
    return_tensors='pt',
)

where the encoding variable consists the input_ids and attention_masks for each sentence respectively.

How could I overwrite or overcome the tokenizer function?

Topic		Replies	Views
How do we reassemble sub tokens when running a token classification model in inference with a sentence? 🤗Transformers	2	818	January 4, 2023
Tokenizer splits up pre-split tokens 🤗Tokenizers	9	6639	February 9, 2024
Add new tokens for subwords 🤗Tokenizers	9	6828	August 11, 2020
Why multilingual BERT tokenizer doesn't remove accent markers? 🤗Tokenizers	0	917	July 18, 2021
Space token ' ' cannot be add when is_split_into_words = True 🤗Tokenizers	1	460	March 11, 2021

How to make tokenizer convert subword token to an independent token?

Related topics