Split compound words (windfall = wind + fall)

Is there a way to use any of the model to split words into word parts that some might use downstream ? For example, consider a custom domain where words like windfall, firewall exists - but user may search for “wind fall” or “fire wall” downstream. Most basic way I thought was to split them randomly into multiple parts and “accept” a split whose sub parts make sense. For example, windfall = w + indfull, wi + ndfull, win + dfull and so on…Then apply existing language model to see if the subparts words exist in vocabulary.

Appreciate if anyone has pointers.

Hi @mangled,

I believe subword tokenizers (i.e. PreTrained Tokenizer from HuggingFace which is based on SentencePiece) already do this sort of word splitting.

But most models rely on their own pretrained tokenizer with their own fixed vocab, so you may not have the same subword units. You may try to pre-train your own tokenizer.

Hope this helps.

Thanks for the response. Turned out, doing this cheaply by splitting words based on whats in the vocab seems to work okay