ToMe for Text Transformers

I have recently skimmed the ToMe paper recently released via Meta Research. I don’t personally have experience with visual transformers, and am wondering if ToMe has implications/use in text based transformers, or if the underlying intuition behind the token merging is only applicable for the spatial/audio/video modalities. I was looking through the github repo earlier, and definitely have more papers to read to understand ToMe in its entirety, but was wondering if anyone here could give me a quick answer on this.