NLLB tokenizer multiple target/source languages within a training batch
|
|
0
|
19
|
September 25, 2023
|
Get "using the `__call__` method is faster" warning with DataCollatorWithPadding
|
|
6
|
7717
|
September 21, 2023
|
Decode token IDs into a list (not a single string)
|
|
3
|
196
|
September 18, 2023
|
Error training MLM with Roberta Tokenizer
|
|
1
|
866
|
September 17, 2023
|
Cannot load tokenizer for llama2
|
|
0
|
101
|
September 14, 2023
|
Are the slow and fast tokenizer results the same output for the same input?
|
|
0
|
100
|
August 30, 2023
|
Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).`
|
|
7
|
5997
|
August 26, 2023
|
SentencePiece user_defined_symbols and fast tokenizers
|
|
0
|
113
|
August 25, 2023
|
"Add_tokens" breaks words when encoding
|
|
2
|
479
|
August 22, 2023
|
2 tokens for one character in T5
|
|
2
|
709
|
August 10, 2023
|
OSError: Model name 'gpt2' was not found in tokenizers model name list (gpt2,...)
|
|
8
|
4767
|
August 10, 2023
|
DNA long sequence tokenization
|
|
2
|
1272
|
August 6, 2023
|
SentencePiece tokenizer encodes to unknown token
|
|
0
|
160
|
August 2, 2023
|
Tokenizer behaviour with pipeline
|
|
0
|
100
|
August 1, 2023
|
Load tokenizer from file : Exception: data did not match any variant of untagged enum ModelWrapper
|
|
3
|
2453
|
August 1, 2023
|
Converting TikToken to Huggingface Tokenizer
|
|
0
|
1045
|
March 10, 2023
|
ArrowInvalid: Column 3 named attention_mask expected length 1000 but got length 1076
|
|
3
|
2042
|
July 26, 2023
|
Discussing the Pros and Cons of Using add_tokens vs. Byte Pair Encoding (BPE) for Adding New Tokens to an Existing RoBERTa Model
|
|
0
|
134
|
July 14, 2023
|
Initialize Vocabulary for Unigram Tokenizer
|
|
0
|
107
|
July 11, 2023
|
Make correct padding for text generation with GPT-NEO
|
|
0
|
194
|
July 5, 2023
|
DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length
|
|
0
|
218
|
July 4, 2023
|
How does a tokenzier (eg., AutoTokenizer) generate word_ids intergers?
|
|
0
|
131
|
June 26, 2023
|
Seeking an end-to-end example of grouping, tokenization and padding to construct preprocessed data in HF
|
|
0
|
134
|
June 26, 2023
|
Writing custom tokenizer and wrapping it in tokenizer object
|
|
2
|
239
|
June 26, 2023
|
Error loading tokenizer from local checkpoint directory
|
|
0
|
315
|
June 25, 2023
|
AutoTokenizer is very slow when loading llama tokenizer
|
|
1
|
620
|
June 22, 2023
|
Tokenizer for German lang
|
|
0
|
152
|
June 22, 2023
|
Chunk tokens into desired chunk length without simply getting rid of rest of tokens
|
|
0
|
135
|
June 15, 2023
|
Error with new tokenizers (URGENT!)
|
|
8
|
18137
|
June 12, 2023
|
Padding not transferring when loading a tokenizer trained via the tokenizers library into transformers
|
|
0
|
202
|
June 12, 2023
|