🤗Tokenizers

Topic	Replies	Views	Activity
T5Tokenizer add a whitespace token after added special tokens	0	336	November 22, 2023
Unable to register my own tokenizer	0	182	November 21, 2023
Get Problem with Doubled tokens in NLLB Tokenizer After load new vocab!	0	271	November 21, 2023
Use Unicode blocks in regex (in Replace normalizer)	1	1078	November 9, 2023
How to handle translations one source language to many target sentences for the same language	0	180	November 9, 2023
NER Label tokenization with overflowing tokens	4	1427	November 6, 2023
How to use a trained tokenizer for semantic search?	0	362	November 5, 2023
Unk_token not set after training a BPETokenizer tokenizer	1	603	November 1, 2023
AutoTokenizer is very slow when loading llama tokenizer	2	1830	October 31, 2023
Offset mappings differ for tokenizers	0	1664	October 30, 2023
The process for tokenizing concatenated dataset is slow st the end of tokenizing	0	167	October 30, 2023
Batch tokenize (split into tokens, without processing)	4	738	October 28, 2023
Does AutoTokenizer uploads data to HuggingFace	0	200	October 25, 2023
RobertaTokenizer decode and tokenize do not have the same output	0	247	October 24, 2023
Training tokenizers with padding in between tokens	0	377	October 19, 2023
Leaving unknown words untokenized like in OpenMNT	0	254	October 18, 2023
Keeping special chars in translations	0	302	October 12, 2023
Should cls_token be [CLS] or <cls>?	3	276	October 11, 2023
How to make tokenizer add the spaces correctly when decoding a sequence when set add_prefix_space=False	0	565	October 9, 2023
I was using huugginfface meta-llama/Llama-2-7b-chat-hf and im facing an error	2	2566	October 8, 2023
OSError: Can't load tokenizer for 'facebook/xmod-base'	1	1223	October 6, 2023
`GPT2Tokenizer` Tokenizer handling `\n\n` differently in different settings	4	788	October 4, 2023
DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length	2	1482	October 3, 2023
Error training MLM with Roberta Tokenizer	1	1444	September 17, 2023
Are the slow and fast tokenizer results the same output for the same input?	0	563	August 30, 2023
"Add_tokens" breaks words when encoding	2	1273	August 22, 2023
2 tokens for one character in T5	2	1617	August 10, 2023
OSError: Model name 'gpt2' was not found in tokenizers model name list (gpt2,...)	8	7392	August 10, 2023
DNA long sequence tokenization	2	2756	August 6, 2023
SentencePiece tokenizer encodes to unknown token	0	879	August 2, 2023