Model Performance and Sanity check

I’m working on a small project to get my head around transformer models, and I’ve hit a bit of a roadblock. I was hoping someone could give me a sanity check on my steps and expectations. Currently I’m working on Inference and optimization, and I’m not sure if my performance numbers below are normal or can be improved.

Project Goal
I want to create a simple model that takes a lowercase address and outputs an uppercase version. It should be trained from scratch, no pre-training, and ideally support multiple languages.

Data Prep
I have 1 million addresses. 10% is eval. The output is the same as lowercase input, but with added special tokens like [capitalize-first] or [capitalize-all]. My thought is that a model will learn easier if the tokenized words are the same, and it only needs to learn inserting the capitalization special tokens. All inputs/outputs are padded to the address with longest token length.

Training
I’ve mostly stuck with the model’s config defaults, and only tweak batch, learning_rate, and decay. I use the seq2seq trainer with BF16 or TF32. I run it until validation loss starts to normalize. Typically 5-10 epochs.

Optimization
I use ORToptimizer at level 1 (had trouble with others), and then use OrtQuantizer on all pieces.
I also tried OpenVino optimizer, plus 8bit quant, but the results were worse for inference speed.

Prompting
Use ORTModelForSeq2SeqLM, and have tried BetterTransformer. Mostly I stay in the Transformers library with encode->generate->decode, and haven’t seen a significant difference between that and pipeline method. I also could not get pipeline to output special tokens.

Models / Tokenizers
From what I understand, “ForConditionalGeneration” is my optimal method based on my goal. This has limited me to mostly testing with the Pegasus, T5, and BART families of models. I’ve dabbled with a few others, but T5 and BART have been my primary models. Tokenizer is matching base model with added special tokens.

My best results have been with ByT5. My thought here is that because it’s a tiny vocabulary, it breaks words into much smaller pieces and doesn’t have to learn capitalization rules for every individual word. After about 5 epochs val loss is at <0.002. Running BART gets acceptable results, but significantly worse at 0.01.

Inference Performance
I’m trying to stick with CPU for inference so I can easily deploy without having to add GPUs to all the production servers. Testing was done on my 11th gen i5 and measurement is of “model.generate”.
ByT5 seems to win here as well despite needing much longer token lengths.

Unoptimized:
ByT5: 375ms
mT5: 675ms
BART: 1020ms

Optimized(1) + Quantized:
ByT5: 120ms
ByT5 w/fastT5: 100ms
BART: 330ms

Questions

  • Performance - Are these results to be expected for CPU inference on these models? To be production worthy I would need it in the 1-5ms range, so I’m quite a way off right now. I currently use an old NLP library on my servers that does far more and is capable of sub-ms on CPU, so I feel it should be doable in some form. Are there any other big steps I can perform to get the model to that point?

  • Model Choice - Are there any better models that would be more accurate or faster for my use case, and is “ForConditionalGeneration” the best/only method?

Any advice is greatly appreciated!