🤗Tokenizers

Topic	Replies	Views	Activity
Train Retry Tokenizer	0	224	April 18, 2023
Pretokenise on punctuation except hyphens	0	294	April 15, 2023
Tokenizer Trainer Crashing	0	742	April 15, 2023
Tokenizer extremely slow when deployed to a container	0	1296	April 14, 2023
Dealing with Decimal and Fractions	1	1576	October 27, 2022
`add_tokens` with argument `special_tokens=True` vs `add_special_tokens`	0	367	April 5, 2023
Unable to upload custom Pytorch model in huggingface	0	375	April 4, 2023
RuntimeError: Cannot re-initialize CUDA in forked subprocess	2	3165	April 3, 2023
Overflowing Tokens in MarkupLM	0	444	March 31, 2023
I get the predicted token as ` े` . What am I doing wrong?	1	616	March 27, 2023
<unk> token in the output instead curly braces	0	501	March 25, 2023
How to add a new token without expanding the vocabulary	0	782	March 24, 2023
Does the ByteLevelBPETokenizer need to be wrapped in a normal Tokenizer?	0	1851	March 18, 2023
What is required to create a fast tokenizer? For example for a Marian model	0	317	March 16, 2023
GPT2Tokenizer.decode maps unicode sequences to the same string '�'	3	1207	March 15, 2023
Issue with Tokenizer	0	682	March 14, 2023
Tokenizing my novel for GPT model	0	847	March 10, 2023
How to add additional custom pre-tokenization processing?	6	5233	March 7, 2023
Customize FlauBERT tokenizer to split line breaks	0	272	March 4, 2023
How to change the size of model_max_length?	0	955	March 3, 2023
Can't get to the source code of `tokenizer.convert_tokens_to_string`	0	342	February 28, 2023
Why I'm getting same result with or without using Wav2Vec2Processor?	0	332	February 25, 2023
How does `tokenizer().input_ids` work and how different it is from tokenizer.encode() before `model.generate()` and decoding step?	1	2987	February 22, 2023
What file type should my training data be?	0	295	February 20, 2023
Best way to get the closest token indices of input of char_to_token is a whitespace	0	1001	February 19, 2023
Token indices sequence length is longer than the specified maximum sequence length	4	23460	February 15, 2023
Create a simple tokenizer	0	423	February 14, 2023
Sliding window for Long Documents	1	2104	February 9, 2023
Creating tokenizer from counts file?	0	218	February 9, 2023
Tokenizer.train() running out of memory	0	764	February 9, 2023