Tokenizer dataset is very slow
|
|
3
|
2808
|
March 2, 2024
|
T5 tokenizer vs t51.1 tokenizer
|
|
0
|
125
|
March 1, 2024
|
Generate tokenizer.json for Marian(Opus) MT
|
|
1
|
542
|
February 26, 2024
|
Phi model giving extra ids than vocab size of tokenizer so Phi-2 tokenizer.batch_decode() giving error: expected string got NoneType
|
|
0
|
263
|
February 24, 2024
|
I/O error calling ToenizersLibrary.createTokenizer in container
|
|
1
|
163
|
February 23, 2024
|
T5v1.1 tokenizer legacy=False
|
|
0
|
198
|
February 22, 2024
|
How to deal SQL query in tabular dataset?
|
|
0
|
149
|
February 17, 2024
|
Issue with german umlauts python in deepseek-ai/deepseek-coder-1.3b-instruct
|
|
0
|
141
|
February 16, 2024
|
Incorporating my tokenizer into huggingface
|
|
0
|
124
|
February 15, 2024
|
Tokenizer splits up pre-split tokens
|
|
9
|
5005
|
February 9, 2024
|
Building a custom Java tokenizer
|
|
0
|
270
|
February 4, 2024
|
Adding New Tokens to MarianMT Model
|
|
8
|
259
|
February 4, 2024
|
Get "using the `__call__` method is faster" warning with DataCollatorWithPadding
|
|
7
|
13544
|
February 1, 2024
|
Issue with KOSMOS-2 encoding and decoding
|
|
11
|
295
|
January 26, 2024
|
Adding new tokens while preserving tokenization of adjacent tokens
|
|
4
|
14469
|
January 25, 2024
|
Is there a way to save a pre-compiled AutoTokenizer?
|
|
1
|
182
|
January 25, 2024
|
Issues with BPE tokenizer
|
|
2
|
180
|
January 24, 2024
|
FastTokenizer add 10 more tokens in Avg
|
|
0
|
144
|
January 20, 2024
|
Added Tokens Not Decoding with Spaces
|
|
3
|
2116
|
January 19, 2024
|
SOLVED: Module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. for train_dataset.map(tokenize, batched=True) in notebook
|
|
1
|
2919
|
January 18, 2024
|
Special_tokens_mask
|
|
0
|
106
|
January 15, 2024
|
Unmasking adds an extra whitespace for BPE tokenizer
|
|
0
|
178
|
January 14, 2024
|
Caching tokenization
|
|
0
|
127
|
January 14, 2024
|
Right choice of padding side for Mistral
|
|
0
|
988
|
January 8, 2024
|
Regular tokens vs special tokens
|
|
5
|
1825
|
January 8, 2024
|
Issue in loading the saved tokenizer
|
|
1
|
167
|
January 4, 2024
|
SentencePiece user_defined_symbols and fast tokenizers
|
|
1
|
985
|
January 3, 2024
|
Many ambiguous unicode characters for trained tokenizer
|
|
0
|
196
|
December 31, 2023
|
Skew between mistral prompt in docs vs. chat template
|
|
2
|
822
|
December 27, 2023
|
Tokenizer shrinking recipes
|
|
7
|
1473
|
December 24, 2023
|