Customization of Wav2Vec2CTCTokenizer with rules

spicci · August 22, 2022, 10:52am

Hi, my goal is to fine-tune an ASR model, WavLM, that relies on the pretrained tokenizer Wav2Vec2CTCTokenizer.

I want to fine-tune this ASR model with another language and to perform the tokenization according to phonological rules, such as syllable segmentation.

Providing a vocabulary with all the possible syllables (aka my tokens), is it possible to customize the Wav2Vec2CTCTokenizer segmentation so that it will respect syllable segmentation rules?

Example:

Original sentence:
Il tentativo era cosi bello

Segmentation made by Wav2Vec2CTCTokenizer (not respecting syllabification rules):
[‘il’, ‘ten’, ‘tat’, ‘iv’, ‘o’, ‘Er’, ‘a’, ‘kos’, ‘i’, ‘bEl’, ‘lo’]

Expected segmentation according to syllabification rules:
[‘il’, ‘ten’, ‘ta’, ‘ti’, ‘vo’, ‘E’, ‘ra’, 'ko, ‘si’, ‘bEl’, ‘lo’]

Basically, I need to state and include some rules in the tokenizer, for example to give priority to tokens with a consonant in the onset position instead of in the coda of the syllable.

Is it possible to insert this kind of rules in the tokenizer?
If so, where can I modify these parameters?

If not, if I train a new tokenizer, will it be ok to implement it in the pre-trained WavLm model that I need to fine-tune?

Thanks in advance!

Topic		Replies	Views
Inference of finetuned wav2vec2-xls-r-300m model using the ASR pipeline does not remove special tokens 🤗Transformers	2	520	January 22, 2022
Issues in recognition of word boundaries - fine-tuned WavLM and subword tokenizer Beginners	0	344	November 28, 2022
Improving performance of Wav2Vec2 fine tuning with word piece vocabulary Research	5	2993	October 27, 2021
Fine-Tune Wav2Vec2 for English ASR with 🤗 Transformers article bug Beginners	15	2728	March 7, 2024
Different versions of 'wav2vec2' model and their differences Beginners	1	1494	August 7, 2021

Customization of Wav2Vec2CTCTokenizer with rules

Related topics