Building a custom Java tokenizer
|
|
0
|
601
|
February 4, 2024
|
Adding New Tokens to MarianMT Model
|
|
8
|
741
|
February 4, 2024
|
Issue with KOSMOS-2 encoding and decoding
|
|
11
|
468
|
January 26, 2024
|
Adding new tokens while preserving tokenization of adjacent tokens
|
|
4
|
18468
|
January 25, 2024
|
Is there a way to save a pre-compiled AutoTokenizer?
|
|
1
|
345
|
January 25, 2024
|
Issues with BPE tokenizer
|
|
2
|
267
|
January 24, 2024
|
FastTokenizer add 10 more tokens in Avg
|
|
0
|
197
|
January 20, 2024
|
Added Tokens Not Decoding with Spaces
|
|
3
|
2806
|
January 19, 2024
|
SOLVED: Module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. for train_dataset.map(tokenize, batched=True) in notebook
|
|
1
|
8828
|
January 18, 2024
|
Special_tokens_mask
|
|
0
|
173
|
January 15, 2024
|
Unmasking adds an extra whitespace for BPE tokenizer
|
|
0
|
269
|
January 14, 2024
|
Caching tokenization
|
|
0
|
232
|
January 14, 2024
|
Right choice of padding side for Mistral
|
|
0
|
2057
|
January 8, 2024
|
Regular tokens vs special tokens
|
|
5
|
3318
|
January 8, 2024
|
Issue in loading the saved tokenizer
|
|
1
|
235
|
January 4, 2024
|
SentencePiece user_defined_symbols and fast tokenizers
|
|
1
|
1519
|
January 3, 2024
|
Many ambiguous unicode characters for trained tokenizer
|
|
0
|
361
|
December 31, 2023
|
Skew between mistral prompt in docs vs. chat template
|
|
2
|
1119
|
December 27, 2023
|
Tokenizer shrinking recipes
|
|
7
|
2535
|
December 24, 2023
|
How to decode with custom pad tokens
|
|
3
|
4074
|
December 22, 2023
|
Training sentencePiece from scratch?
|
|
8
|
18819
|
December 19, 2023
|
Questions re: Tokenizer pipeline composability / reuse outside of the HF ecosystem
|
|
0
|
213
|
December 18, 2023
|
Get intermediate tokens and merges used in tokenization
|
|
0
|
457
|
December 1, 2023
|
BertTokenizer.decode not understanding new vocabulary
|
|
0
|
344
|
December 1, 2023
|
Tokenizer tend to choose added tokens first rather than token in vocab
|
|
1
|
541
|
November 30, 2023
|
Special token printed out as output
|
|
6
|
1011
|
November 24, 2023
|
[NER][Japanese] labeled segment shorter than token
|
|
0
|
215
|
November 23, 2023
|
T5Tokenizer add a whitespace token after added special tokens
|
|
0
|
327
|
November 22, 2023
|
Unable to register my own tokenizer
|
|
0
|
181
|
November 21, 2023
|
Get Problem with Doubled tokens in NLLB Tokenizer After load new vocab!
|
|
0
|
271
|
November 21, 2023
|