Construct a Marian tokenizer. Based on huggingface tokenizers

RaphaelKalandadze · May 7, 2024, 5:27pm

I have a Hugging Face tokenizer with three files: tokenizer.json, tokenizer_config.json, and vocab.txt. However, according to the documentation, the Marian tokenizer requires files in the SentencePiece format (.model and .vocab files). Is there a way to construct a Marian tokenizer without these specific file formats?

I have already tried converting the tokenizer into a .model file and then constructing the Marian tokenizer, but it raised new issues. Are there any alternative approaches or workarounds to use the existing tokenizer files with the Marian tokenizer?

Topic		Replies	Views
[MarianTokenizer] Clarify the use of the vocab parameter 🤗Transformers	3	809	September 29, 2024
What is required to create a fast tokenizer? For example for a Marian model 🤗Tokenizers	0	315	March 16, 2023
How to train Marian Machine Translation Models	1	1036	June 23, 2022
Fast tokenizer for marianMTModel 🤗Tokenizers	1	516	September 26, 2022
How to tokenize input if I plan to train a Machine Translation model. I'm having difficulties with text_pair argument of Tokenizer() Beginners	4	1932	November 4, 2021

Construct a Marian tokenizer. Based on huggingface tokenizers

Related topics