ByT5 and LongT5

A tokenizer-free byte-level model like ByT5 apparently has a lot of advantages when using noisy data, and on tasks relating to pronunciation. But its disadvantage is the sequence lengths are much more limited since they’re measured in bytes not tokens.

To my hobbyist eye, it seems that the sequence length disadvantage could be somewhat mitigated by using an efficient attention mechanism like those in LongT5. But as far as I’ve been able to google, the two don’t seem to have been combined, even though they’re both built from the same base architecture.

Is there a reason the two techniques haven’t been used together? Is the use case too niche? It seems like it would be great for a project I’m messing with that uses poetry/lyrics, most of which are greater than 1024 bytes long.

@jncasey: I just thought the same and stumbled across your question.

Did you find any answer / results back then?

I didn’t investigate any further (work/life got in the way), but I’ve been meaning to pick up this project again relatively soon. If you find anything else that’s happened in the last year or so, please let me know!