Convert huggingface tokenizer into sentencepiece format

RaphaelKalandadze · May 7, 2024, 5:31pm

I have a huggingface tokenizer for the BERT model (google-bert/bert-base-cased) which includes three files: tokenizer.json, tokenizer_config.json, and vocab.txt. I would like to convert this tokenizer into the SentencePiece tokenizer format, which uses a single .model file.
How can I perform this conversion?

bh4 · November 27, 2024, 2:06pm

Similar problem here. I would like to convert smollm2-360m hugging face tokenizer to sentencepiece format but couldn’t find any way of doing so. Can anyone guide?

Topic		Replies	Views
How to create a Huggingface tokenizer from a non-Huggingface tokenizer? 🤗Tokenizers	0	520	May 4, 2021
Construct a Marian tokenizer. Based on huggingface tokenizers 🤗Tokenizers	0	205	May 7, 2024
How to create a hugging face compatible tokenizer from a vocab file? Beginners	0	250	May 23, 2024
Custom huggingface Tokenizer with custom model for BERT Beginners	0	779	May 13, 2021
How to convert HuggingFace tokenizers into ONNX format? 🤗Tokenizers	1	642	December 5, 2022

Convert huggingface tokenizer into sentencepiece format

Related topics