I am looking for tokenizer-free language model here. However, I can only find the results about ByT5, which is pretrained with a variant of mlm. Is there any model trained with a casual language model directly based on UTF-8 encoding bytes?
Reference work: ByT5: Towards a token-free future with pre-trained byte-to-byte models