🤗Tokenizers

Topic	Replies	Views	Activity
ONNX T5 - Decoding seq2seq tokens	1	507	May 8, 2024
Construct a Marian tokenizer. Based on huggingface tokenizers	0	216	May 7, 2024
Can't load tokenizer using from_pretrained, Inference API	4	1829	May 6, 2024
A question about the DataCollator for LM	2	416	May 6, 2024
Asking to pad but the tokenizer does not have a padding token	0	1779	May 6, 2024
Which file stores token frequency in SentencePieceBPETokenizer?	0	183	May 3, 2024
Documentation of SentencePieceBPETokenizer?	0	914	May 2, 2024
Converting TikToken to Huggingface Tokenizer	1	2595	April 22, 2024
Tokenizer mapping the same token to multiple token_ids	4	760	April 22, 2024
Treat Hawaiian Glottal stop as consonant, not punctuation	0	170	April 19, 2024
Train tokenizer for seq2seq model	0	359	April 19, 2024
ViTImageProcessor output visualization	8	733	April 18, 2024
Escape symbol appearance	0	136	April 16, 2024
Loading BPE modeled Tokenizer results in empty tokenizer	0	344	April 15, 2024
Translate from one tokenizer to another	0	172	April 15, 2024
Custom training - tokenization via collate fn or __getitem__?	0	416	April 14, 2024
Running train_new_from_iterator to train a tokenizer is very slow	1	439	April 13, 2024
Printing tokens array	0	134	April 12, 2024
Preprocessing of dataset	0	177	April 10, 2024
Is it safe to assume tokenizer does not change after initialization?	0	180	March 30, 2024
WordPiece tokenizer doesn't work for long sequences	1	401	March 28, 2024
OPT special tokens	0	162	March 25, 2024
Inputs.word_ids() length not matching word label length	3	556	March 22, 2024
How does `byte_fallback` work and affect vocab size in BPE?	1	2056	March 19, 2024
Custom Tokenizing?	0	245	March 19, 2024
Reused tokenizer returns unk	1	528	March 14, 2024
Adding too many tokens breaks tokenizer	0	315	March 12, 2024
Fastest way to tokenize millions of examples?	4	2940	March 8, 2024
Run Mistral model only on CPU	0	1675	March 6, 2024
Tokenizer not recognising words in vocabulary	4	1924	March 5, 2024