🤗Tokenizers

Topic	Replies	Views	Activity
Convert huggingface tokenizer into sentencepiece format	0	10	May 7, 2024
Construct a Marian tokenizer. Based on huggingface tokenizers	0	11	May 7, 2024
Can't load tokenizer using from_pretrained, Inference API	4	80	May 6, 2024
A question about the DataCollator for LM	2	29	May 6, 2024
Asking to pad but the tokenizer does not have a padding token	0	27	May 6, 2024
Which file stores token frequency in SentencePieceBPETokenizer?	0	32	May 3, 2024
Documentation of SentencePieceBPETokenizer?	0	31	May 2, 2024
Speed up tokenizer training	1	183	April 22, 2024
Converting TikToken to Huggingface Tokenizer	1	1994	April 22, 2024
Tokenizer mapping the same token to multiple token_ids	4	81	April 22, 2024
Treat Hawaiian Glottal stop as consonant, not punctuation	0	79	April 19, 2024
Train tokenizer for seq2seq model	0	66	April 19, 2024
ViTImageProcessor output visualization	8	266	April 18, 2024
How to train a LlamaTokenizer?	15	1536	April 18, 2024
Escape symbol appearance	0	59	April 16, 2024
Loading BPE modeled Tokenizer results in empty tokenizer	0	95	April 15, 2024
Translate from one tokenizer to another	0	77	April 15, 2024
Custom training - tokenization via collate fn or __getitem__?	0	87	April 14, 2024
Running train_new_from_iterator to train a tokenizer is very slow	1	126	April 13, 2024
Printing tokens array	0	54	April 12, 2024
Preprocessing of dataset	0	86	April 10, 2024
Error with new tokenizers (URGENT!)	12	38448	April 4, 2024
Error loading tokenizer from local checkpoint directory	2	1065	April 4, 2024
Is it safe to assume tokenizer does not change after initialization?	0	109	March 30, 2024
WordPiece tokenizer doesn't work for long sequences	1	337	March 28, 2024
Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).`	9	12956	March 26, 2024
Adding tokens, but tokenizer doesn't use them	0	117	March 25, 2024
OPT special tokens	0	82	March 25, 2024
Inputs.word_ids() length not matching word label length	3	377	March 22, 2024
Cannot load tokenizer for llama2	4	1889	March 20, 2024