Speed up tokenizer training
|
|
1
|
150
|
April 22, 2024
|
Converting TikToken to Huggingface Tokenizer
|
|
1
|
1922
|
April 22, 2024
|
Tokenizer mapping the same token to multiple token_ids
|
|
4
|
61
|
April 22, 2024
|
Treat Hawaiian Glottal stop as consonant, not punctuation
|
|
0
|
60
|
April 19, 2024
|
Train tokenizer for seq2seq model
|
|
0
|
42
|
April 19, 2024
|
ViTImageProcessor output visualization
|
|
8
|
232
|
April 18, 2024
|
How to train a LlamaTokenizer?
|
|
15
|
1410
|
April 18, 2024
|
Escape symbol appearance
|
|
0
|
43
|
April 16, 2024
|
Loading BPE modeled Tokenizer results in empty tokenizer
|
|
0
|
65
|
April 15, 2024
|
Translate from one tokenizer to another
|
|
0
|
61
|
April 15, 2024
|
Custom training - tokenization via collate fn or __getitem__?
|
|
0
|
63
|
April 14, 2024
|
Running train_new_from_iterator to train a tokenizer is very slow
|
|
1
|
104
|
April 13, 2024
|
Printing tokens array
|
|
0
|
44
|
April 12, 2024
|
Preprocessing of dataset
|
|
0
|
68
|
April 10, 2024
|
Error with new tokenizers (URGENT!)
|
|
12
|
37455
|
April 4, 2024
|
Error loading tokenizer from local checkpoint directory
|
|
2
|
1034
|
April 4, 2024
|
Is it safe to assume tokenizer does not change after initialization?
|
|
0
|
90
|
March 30, 2024
|
WordPiece tokenizer doesn't work for long sequences
|
|
1
|
318
|
March 28, 2024
|
Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).`
|
|
9
|
12586
|
March 26, 2024
|
Adding tokens, but tokenizer doesn't use them
|
|
0
|
98
|
March 25, 2024
|
OPT special tokens
|
|
0
|
71
|
March 25, 2024
|
Inputs.word_ids() length not matching word label length
|
|
3
|
349
|
March 22, 2024
|
Cannot load tokenizer for llama2
|
|
4
|
1698
|
March 20, 2024
|
How does `byte_fallback` work and affect vocab size in BPE?
|
|
1
|
805
|
March 19, 2024
|
Custom Tokenizing?
|
|
0
|
87
|
March 19, 2024
|
Reused tokenizer returns unk
|
|
1
|
464
|
March 14, 2024
|
Adding too many tokens breaks tokenizer
|
|
0
|
128
|
March 12, 2024
|
Fastest way to tokenize millions of examples?
|
|
4
|
1948
|
March 8, 2024
|
Run Mistral model only on CPU
|
|
0
|
705
|
March 6, 2024
|
Tokenizer not recognising words in vocabulary
|
|
4
|
1355
|
March 5, 2024
|