🤗Tokenizers

Topic	Replies	Views	Activity
T5v1.1 tokenizer legacy=False	0	688	February 22, 2024
How to deal SQL query in tabular dataset?	0	194	February 17, 2024
Issue with german umlauts python in deepseek-ai/deepseek-coder-1.3b-instruct	0	224	February 16, 2024
Incorporating my tokenizer into huggingface	0	249	February 15, 2024
Tokenizer splits up pre-split tokens	9	6707	February 9, 2024
Building a custom Java tokenizer	0	634	February 4, 2024
Adding New Tokens to MarianMT Model	8	771	February 4, 2024
Issue with KOSMOS-2 encoding and decoding	11	472	January 26, 2024
Adding new tokens while preserving tokenization of adjacent tokens	4	18952	January 25, 2024
Is there a way to save a pre-compiled AutoTokenizer?	1	355	January 25, 2024
Issues with BPE tokenizer	2	279	January 24, 2024
FastTokenizer add 10 more tokens in Avg	0	201	January 20, 2024
Added Tokens Not Decoding with Spaces	3	2875	January 19, 2024
SOLVED: Module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. for train_dataset.map(tokenize, batched=True) in notebook	1	9617	January 18, 2024
Special_tokens_mask	0	176	January 15, 2024
Unmasking adds an extra whitespace for BPE tokenizer	0	275	January 14, 2024
Caching tokenization	0	247	January 14, 2024
Right choice of padding side for Mistral	0	2122	January 8, 2024
Regular tokens vs special tokens	5	3686	January 8, 2024
Issue in loading the saved tokenizer	1	242	January 4, 2024
SentencePiece user_defined_symbols and fast tokenizers	1	1611	January 3, 2024
Many ambiguous unicode characters for trained tokenizer	0	388	December 31, 2023
Skew between mistral prompt in docs vs. chat template	2	1138	December 27, 2023
Tokenizer shrinking recipes	7	2746	December 24, 2023
How to decode with custom pad tokens	3	4089	December 22, 2023
Training sentencePiece from scratch?	8	19623	December 19, 2023
Questions re: Tokenizer pipeline composability / reuse outside of the HF ecosystem	0	216	December 18, 2023
Get intermediate tokens and merges used in tokenization	0	480	December 1, 2023
BertTokenizer.decode not understanding new vocabulary	0	350	December 1, 2023
Tokenizer tend to choose added tokens first rather than token in vocab	1	551	November 30, 2023