Tokenizing using JS

I’ve exported a custom PyTorch-based Transformer model into ONNX to run it on NodeJS. However, the exported model seems to expect input_ids directly (and not raw text).

Is there any way I can perform tokenization in JS?

Or is there something I’m missing, wherein the ONNX model itself is capable of performing the tokenization as well?

1 Like

I have the same problem. Seems converting to onnx is only half the battle. Maybe I’ll write a library. How hard could it be?

1 Like

I have written a JavaScript library that is capable of running the T5 tokenizer: transformers-js/tokenizers.js at main · praeclarum/transformers-js · GitHub

3 Likes

Awesome library @praeclarum! I used it as inspiration when developing https://github.com/xenova/transformers.js, which supports BERT, DistilBERT, T5 and GPT2 tokenization and inference.

1 Like

Thank you, so so much. This library finally made it possible for even a noob such as myself to execute my models in the browser. I’d probably get a C in my Data Mining class if I didn’t find this amazing tool.

1 Like