I have recently skimmed the ToMe paper recently released via Meta Research. I don’t personally have experience with visual transformers, and am wondering if ToMe has implications/use in text based transformers, or if the underlying intuition behind the token merging is only applicable for the spatial/audio/video modalities. I was looking through the github repo earlier, and definitely have more papers to read to understand ToMe in its entirety, but was wondering if anyone here could give me a quick answer on this.
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Is it possible to tokenize multiple text modalities? | 1 | 458 | September 1, 2022 | |
Multimodal architectures with HuggingFace transformers for speech and text | 3 | 1139 | November 14, 2022 | |
Combine multiple embeddings from different authors | 0 | 152 | May 15, 2024 | |
Multimodal transformer | 0 | 1078 | April 23, 2023 | |
Are transformer-based encoders just "text embeddings"? | 0 | 1290 | March 13, 2023 |