ByT5 and LongT5

jncasey · March 1, 2023, 9:44pm

A tokenizer-free byte-level model like ByT5 apparently has a lot of advantages when using noisy data, and on tasks relating to pronunciation. But its disadvantage is the sequence lengths are much more limited since they’re measured in bytes not tokens.

To my hobbyist eye, it seems that the sequence length disadvantage could be somewhat mitigated by using an efficient attention mechanism like those in LongT5. But as far as I’ve been able to google, the two don’t seem to have been combined, even though they’re both built from the same base architecture.

Is there a reason the two techniques haven’t been used together? Is the use case too niche? It seems like it would be great for a project I’m messing with that uses poetry/lyrics, most of which are greater than 1024 bytes long.

gohrner · March 2, 2024, 4:52pm

@jncasey: I just thought the same and stumbled across your question.

Did you find any answer / results back then?

jncasey · March 2, 2024, 5:14pm

I didn’t investigate any further (work/life got in the way), but I’ve been meaning to pick up this project again relatively soon. If you find anything else that’s happened in the last year or so, please let me know!

Topic		Replies	Views
Is there any more tokenizer-free language model available? Models	0	560	March 12, 2022
Distillation for LongT5 Beginners	0	193	January 6, 2024
Fine tune LongT5 mdoel Models	4	913	December 15, 2022
Pretrain and Fine Tune Byte-level model for multilingual extractive QA (Like ByT5) Flax/JAX Projects	13	1983	July 2, 2021
Flan-T5 - Finetuning to a Longer Sequence Length (512 -> 2048 tokens): Will it work? Beginners	3	4185	January 9, 2024

ByT5 and LongT5

Related topics