BPEtokenizer reports error "not valid UTF-8" when processing txt file

Tommy-Zhao · January 28, 2025, 2:28pm

The code and error are as follows, how can I fix it? any suggestion welcome!

John6666 · January 28, 2025, 3:11pm

This issue?

github.com/huggingface/tokenizers

Exception: stream did not contain valid UTF-8

opened 08:54AM - 28 May 20 UTC

closed 04:29PM - 29 Jun 20 UTC

phamdinhkhanh

I get bug when tokenize ByteLevelBPETokenizer() for diacritic language in utf-16… such as 'Viet Nam' language. Bellow are my code initialize tokenizer. ``` %%time from pathlib import Path from tokenizers import ByteLevelBPETokenizer paths = ['file1.txt', 'file2.txt'] print(paths) # Initialize a tokenizer tokenizer = ByteLevelBPETokenizer() # Customize training tokenizer.train(files=paths, vocab_size=52000, min_frequency=2, special_tokens=[ "<s>", "<pad>", "</s>", "<unk>", "<mask>", ]) ``` And bug log: > <ipython-input-78-66e6ec31bd7b> in train(self, files, vocab_size, min_frequency, show_progress, special_tokens) > 90 files = [files] > 91 print('files list: \n', files) > ---> 92 self._tokenizer.train(trainer, files) > > Exception: stream did not contain valid UTF-8 my `file1.txt` and `file2.txt` contain words like: `xin chào tôi đến từ Việt Nam, tôi gặp vấn đề với tokenizer.` I try to find what self._tokenizer.train() does to fix it myself but project code are complicated. Can you explain what i was wrong?

Tommy-Zhao · January 28, 2025, 5:01pm

I read this thread, but it didn’t solve the problem
I downloaded the target file, opened it with notepad, and found that the encoding was already UTF-8

I have run this program on colab and everything is normal, but when running it in jupyter notebook, it reports an error “not contain valid UTF-8”

John6666 · January 28, 2025, 5:34pm

If it doesn’t work in Colab but does work in Jupyter, I can understand that, but the opposite… That’s rare…
In any case, I think that’s not a Python error, but an error in Rust or something. It’s a rare case where the library version is wrong or the library core is unable to process something and is throwing an error.

github.com/PacktPublishing/Transformers-for-Natural-Language-Processing

"Exception: stream did not contain valid UTF-8" request solution

opened 08:54AM - 09 Jun 21 UTC

closed 12:04PM - 05 Jul 21 UTC

hsupeter

Hi Guys, I run Ch3 KantaiBERT.ipynb instruction, tokenizer.train(files=paths, v…ocab_size=52_000, min_frequency=2, special_tokens=[ "<s>", "<pad>", "</s>", "<unk>", "<mask>",] ) in step 3: Training a Tokenizer, and get the error "Exception: stream did not contain valid UTF-8". I try to search solution in web, some people got the similar problem as well, but don't solve it. Can anyone tell me how to resolve it? I skip this trouble cell and run following cells, they can execute normally. So, are correct the outcomes of following cells? thanks

Tommy-Zhao · January 29, 2025, 3:56am

Thanks for your reply
Attach the tokenizer version of colab and jupyter, they are the same
Most likely it’s what you think

John6666 · January 29, 2025, 5:17am

There was something similar to the issue with the library itself. I don’t know if this is it. If the data being handled is exactly the same, then this probably isn’t it…

github.com/huggingface/tokenizers

Access utf-8 byte sequence for each token

opened 12:29PM - 09 Sep 24 UTC

DanielHesslow

Hi, It would be great if it was possible to get the utf-8 byte sequence corre…sponding to each token id. Since tokenizers return strings, tokens which are not valid unicode strings by themselves will contain � on decode. This eg. makes streaming and constrained generation much more difficult and error prone than it needs to be. Additionally if we can get the uf8 byte sequence, decoding also get's much easier and faster, as it's simply a matter of concatenating the corresponding bytes. Cheers,

Tommy-Zhao · January 29, 2025, 7:46am

Thanks for sharing Rust encoding/decoding knowledge.
The issue has been solved. The Path variable is the source of the problem
In colab, the solo online txt document is obtained, but in jupyter notebook, many txt files in irrelevant directories are obtained

system · January 29, 2025, 7:47pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Load tokenizer from file : Exception: data did not match any variant of untagged enum ModelWrapper 🤗Tokenizers	3	9480	August 1, 2023
HuggingFace BPE Trainer Error - Training Tokenizer 🤗Tokenizers	1	2999	July 14, 2022
ByteLevelBPETokenizer inconsistent behavior 🤗Tokenizers	0	406	July 23, 2020
Byte Level Tokenizer While Training 🤗Tokenizers	0	56	December 14, 2024
Does the ByteLevelBPETokenizer need to be wrapped in a normal Tokenizer? 🤗Tokenizers	0	1846	March 18, 2023

BPEtokenizer reports error "not valid UTF-8" when processing txt file

Related topics