WordPiece tokenizer doesn't work for long sequences
|
|
1
|
358
|
March 28, 2024
|
Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).`
|
|
9
|
13325
|
March 26, 2024
|
Adding tokens, but tokenizer doesn't use them
|
|
0
|
139
|
March 25, 2024
|
OPT special tokens
|
|
0
|
91
|
March 25, 2024
|
Inputs.word_ids() length not matching word label length
|
|
3
|
386
|
March 22, 2024
|
Cannot load tokenizer for llama2
|
|
4
|
2109
|
March 20, 2024
|
How does `byte_fallback` work and affect vocab size in BPE?
|
|
1
|
940
|
March 19, 2024
|
Custom Tokenizing?
|
|
0
|
135
|
March 19, 2024
|
Reused tokenizer returns unk
|
|
1
|
485
|
March 14, 2024
|
Adding too many tokens breaks tokenizer
|
|
0
|
167
|
March 12, 2024
|
Fastest way to tokenize millions of examples?
|
|
4
|
2077
|
March 8, 2024
|
Run Mistral model only on CPU
|
|
0
|
867
|
March 6, 2024
|
Tokenizer not recognising words in vocabulary
|
|
4
|
1425
|
March 5, 2024
|
Tokenizer dataset is very slow
|
|
3
|
2934
|
March 2, 2024
|
T5 tokenizer vs t51.1 tokenizer
|
|
0
|
142
|
March 1, 2024
|
Generate tokenizer.json for Marian(Opus) MT
|
|
1
|
561
|
February 26, 2024
|
Phi model giving extra ids than vocab size of tokenizer so Phi-2 tokenizer.batch_decode() giving error: expected string got NoneType
|
|
0
|
292
|
February 24, 2024
|
I/O error calling ToenizersLibrary.createTokenizer in container
|
|
1
|
190
|
February 23, 2024
|
T5v1.1 tokenizer legacy=False
|
|
0
|
247
|
February 22, 2024
|
How to deal SQL query in tabular dataset?
|
|
0
|
157
|
February 17, 2024
|
Issue with german umlauts python in deepseek-ai/deepseek-coder-1.3b-instruct
|
|
0
|
151
|
February 16, 2024
|
Incorporating my tokenizer into huggingface
|
|
0
|
143
|
February 15, 2024
|
Tokenizer splits up pre-split tokens
|
|
9
|
5180
|
February 9, 2024
|
Building a custom Java tokenizer
|
|
0
|
302
|
February 4, 2024
|
Adding New Tokens to MarianMT Model
|
|
8
|
294
|
February 4, 2024
|
Get "using the `__call__` method is faster" warning with DataCollatorWithPadding
|
|
7
|
13842
|
February 1, 2024
|
Issue with KOSMOS-2 encoding and decoding
|
|
11
|
329
|
January 26, 2024
|
Adding new tokens while preserving tokenization of adjacent tokens
|
|
4
|
14847
|
January 25, 2024
|
Is there a way to save a pre-compiled AutoTokenizer?
|
|
1
|
202
|
January 25, 2024
|
Issues with BPE tokenizer
|
|
2
|
202
|
January 24, 2024
|