Incorporating my tokenizer into huggingface

Hi there,

About a year ago my lab released SaGe, a tokenizer that incorporates contextual signals from corpora and thus learns tokens which are more aligned with LM objectives. The paper is here:

Recently, we released a version that’s much faster than the original, better streamlining the corpus for training the vocab. The (python) implementation is here:

We were wondering if and how we can get support for porting SaGe into hf tokenizers and making it a first-class member of the codebase? Would any of your engineers be able to help? What would you need from us?

Thanks,

  • Yuval
    uvp@cs.bgu.ac.il