How would you approach designing a model for this problem?

Hi. I’m a hobbyist who’s been playing around with the Hugging Face libraries for the last year or so. I’m still very much a beginner and haven’t taken a CS course since about 2002 or a math class in even longer, so my coding is kind of rusty. This question isn’t specifically about the libraries, but more about my overall approach to solving my problem. I’d appreciate any tips or pointers, especially if the process I outline below has any glaring flaws.

The top line is I’m trying to make a seq2seq model to translate English words into a phonetic representation, which will eventually be used for generative text projects that need to learn rhyme and rhythm.

I don’t think there are any existing language models that work in that space, so I’m rolling my own. There are plenty of general English datasets out there in addition to my task-specific data, so all I need to do is preprocess it through a phonetic model and I’ll have tons of data I can work with.

For most of this time, I’ve been using the Pincelate library, which is pretty great. I could preprocess my text by splitting out each word, feeding each word to pincelate to get the phones, and then joining it all back together with the original white space and punctuation for the data to feed into the next phase of my projects. But Pincelate was built with an older version of Tensorflow and doesn’t seem to be getting any updates. It also has been giving me less-than-ideal results with longer words and other edge cases, and is relatively slow.

So I decided to try to make a phonetic model better suited to my projects using HF Transformers.

The training data I have is the same that Pincelate used – the CMUDict pronunciation dictionary of ~130K English words. I don’t know of any other phonetic datasets out there, especially none with full sentences, so I’m sticking with the word-by-word approach I was using before.

CMUDict has two columns of data, the English word and its phonetic representation in ARPAbet notation, with phonemes as one or two capital letters (e.g. AH, T, TH) and syllable stresses noted by digits 0-2.

I processed the data so the English words were all lower case, the phonemes were wrapped in square brackets ([AH], [T], [TH]) and the stress notations were [STR0], [STR1], [STR2]. I had also made a version with unary stress notation (each stress was indicated by 1-3 repetitions of a single [STR] token), but abandoned that after some early training experiments were coming out weird. I might revisit the unary stress approach again, especially now that I realize that 1 indicates the most stress, followed by 2 and then 0. But for now I’m still using 3 separate stress tokens.

Sidenote: each of the phonemes has 3-6 out of 30-some phonetic properties – e.g. [TH] is the “voiceless dental fricative”. I tried a few versions translating English words to these raw properties instead of the phonemes, but I abandoned it since it used 4-5x as many tokens per translated word, which would get crazy in a long document. I assume there must be a way to design a model that incorporates those properties efficiently, but I don’t know how.

Later phases of the project use the processed phonetic data to train BART- and GPT2-like language models (or, hopefully, a single BART-like model whose decoder I can use for causal language tasks). So I figured I may as well use a BART-like model and tokenizer for my phonemizer, so they could share a token vocabulary and maybe avoid a tokenizer decode/encode step.

I created a ByteLevelBPETokenizer, to which I manually added the phonemes and stresses as special tokens so they’d never be broken apart. I also added two tokens that I’d use to specify the tasks for my model: [ENCODE:] and [DECODE:].

I trained a few different versions of the tokenizer on the CMUDict data. I gradually increased the min_frequency threshold, figuring more granularity would make my model better at figuring out words not in the training data. I ended up moving forward with a tokenizer that had no merges at all, just the BART special tokens, my phonemes, and the base BPE tokens.

This also allows me to easily reverse the order of the phonemes at the token level, which is something that’ll come in handy in some later applications. So hopefully it’s not otherwise a dumb idea.

The model design is where I’ve done the most experimenting, and while I’m figuring out more each time, I don’t think I have the optimal design yet. I started from the standard BartConfig, and then made adjustments.

Since the longest English words are 30-40 characters long, and their phonetic representation could be maybe 30% longer, I set max_position_embeddings and d_model to a reasonable ceiling of 64.

That made the model so much smaller than the default BART, so for a while I went wild adding in more layers and attention heads. But I didn’t notice any major differences other than slowing the performance. So I ended up going back to the default BartConfig values for all of those properties. And so that’s where I’m back to for the moment.

Over the last month I’ve trained over a dozen different versions of my phonetic translator. My training script is derived from the example that’s now called I’m using deepspeed because I’m lucky to have access to a couple 3090s from my day job as an animator.

In all versions of the training, I’m training the model on encoding and decoding tasks simultaneously. I load the CMUDict data from a CSV twice, swapping the columns and prepending the appropriate [ENCODE:] and [DECODE:] task tokens, and shuffling it all up.

At the current model size and using fp16, I can fit a per-device batch of 512. At the default learning rate of 5e-5, I was getting what I thought were pretty decent results after about 1500 epochs, with minimal improvement by going longer.

But then I discovered the model was really struggling on some out-of-training words from a medical textbook like “fluoroaortogram” and “osmoreceptorologist”, not just getting some phonemes wrong, but drastically wrong number of syllables.

So I started playing around a lot with learning rate, number of epochs, and even tweaking the dropout in the model. Every time I trained the model even slower or increased the dropout, as long as I kept training for longer I’d get better results.

Right now I’m training the 13th major iteration, where I’ve dropped the learning rate to 1e-5 and increased the dropout to 0.4. At the moment it’s 90% of the way through 4500 epochs, and the intermediate results are looking better. Though I might have to go past epoch 4500.

Intended Usage
Since any model I train will be slower and no more accurate than directly looking up the CMUDict entries for its 130K common words, my phonemizing program breaks a document into individual words, looks up the pronunciation for each word in the CMUDict, and only uses the model to batch process each unique word it doesn’t already know. Then it saves all of the new pronunciations it learns, which comes in handy when there are a lot of documents with similar out-of-dict vocabularies.

So, what can I do better?
First, if you’ve gotten all the way to the end of this novel I’ve just written, thank you.

Does anything in the approach I’ve outlined jump out as being clearly wrong? I know I’m making a lot of guesses and uninformed assumptions, so if there’s anything obviously wrong in my though process, I’d love to learn about it. Or if there’s already any established work being done in this area that I’ve haven’t found, I’d love a pointer.


1 Like