Hi all I have an issue that has been holding me up and driving me crazy for more than 24 hours. In a nutshell: tokenizing a long dataset in a single pass of tokenizer() produces different results compared to iterating through the dataset sample by sample, calling tokenizer() on each. I do not und…

Tokenization: different results when tokenizing in one pass vs sample-by-sample

joshc8c7 October 23, 2023, 5:15pm 4

Anyone got an update on this?

Topic		Replies	Views
The process for tokenizing concatenated dataset is slow st the end of tokenizing 🤗Tokenizers	0	168	October 30, 2023
Fine-tuning - tokenize before or when doing a forward pass over batches 🤗Transformers	2	1559	March 22, 2024
Tokenize a batch of data Models	0	164	May 1, 2023
Preprocessing of dataset 🤗Tokenizers	0	175	April 10, 2024
Tokenizer dataset is very slow 🤗Tokenizers	3	4405	March 2, 2024