How to ensure that tokenizers never truncate partial words?
|
|
2
|
1788
|
January 24, 2022
|
How to ensure the `overflow` with `stride` always starts with a full word?
|
|
0
|
1271
|
January 24, 2022
|
Adding new tokens to a BERT tokenizer - Getting ValueError
|
|
2
|
1475
|
January 16, 2022
|
Adding token to t5-base vocab does not respect space
|
|
0
|
726
|
January 13, 2022
|
How can I change the token id of a special token?
|
|
0
|
883
|
January 6, 2022
|
Import distilbert-base-uncased tokenizer to an android app along with the tflite model
|
|
3
|
1933
|
December 29, 2021
|
What are the equivalent manner for using texts_to_sequences?
|
|
0
|
645
|
December 29, 2021
|
ERROR?why encoding [MASK] before '.' would gain a idx 13?
|
|
5
|
1048
|
December 27, 2021
|
LongFormer tokenizer has the same token_type_ids for sequence pairs
|
|
0
|
714
|
December 20, 2021
|
Batch encode plus in Rust Tokenizers
|
|
1
|
745
|
December 13, 2021
|
Best solution for train tokenizer and MLM from scratch
|
|
0
|
729
|
December 6, 2021
|
Implementing custom tokenizer components (normalizers, processors)
|
|
1
|
2871
|
November 30, 2021
|
Does T5Tokenizer support the Greek language?
|
|
1
|
838
|
November 24, 2021
|
How padding in huggingface tokenizer works?
|
|
4
|
6750
|
November 22, 2021
|
Why we need to add special tokens to tasks other than classification?
|
|
0
|
869
|
November 17, 2021
|
How to configure TokenizerFast for AutoTokenizer
|
|
2
|
1858
|
November 11, 2021
|
How to employ different vocabs for encoder and decoder respectively?
|
|
0
|
675
|
November 9, 2021
|
How to use tokenizer.tokenize in Chinese data properly?
|
|
0
|
907
|
November 9, 2021
|
Mask only specific words
|
|
4
|
3712
|
November 7, 2021
|
Load custom pretrained tokenizer
|
|
0
|
1609
|
October 28, 2021
|
Using Custom Vocab.txt
|
|
0
|
1241
|
October 17, 2021
|
Tokenizer.encode not returning encodings
|
|
2
|
896
|
October 9, 2021
|
There is no 0.11.0 tokenizers in pip
|
|
4
|
787
|
September 30, 2021
|
Performance difference between ByteLevelBPE and Wordpiece tokenizers
|
|
0
|
685
|
September 22, 2021
|
Should have a `model_type` key in its config.json
|
|
0
|
1916
|
September 20, 2021
|
Using a fixed vocab.txt with AutoTokenizer?
|
|
1
|
2304
|
September 13, 2021
|
Train wordpiece from scratch
|
|
2
|
1436
|
September 9, 2021
|
I set up a different batch_size, but the time of data processing has not changed
|
|
0
|
537
|
September 1, 2021
|
Cannot create an identical PretrainedTokenizerFast object from a Tokenizer created by tokenizers library
|
|
1
|
1091
|
August 30, 2021
|
Index of wordpieces (subwords) after tokenization by transformers
|
|
0
|
699
|
August 28, 2021
|