Turkish characters gets corrupted when loading dataset via audiofolder

yilmazay · April 4, 2023, 8:38am

After a lot of debug and analysis, I found the root cause of the problem.
It happened to be the input metadata.csv file’s encoding problem.
It was not load_dataset()'s fault.
While I am generating csv, I convert all content into lowercase using Turkish locale.
Interestingly Turkish upper case I, which is İ, uppercase I with a dot on top of it, was causing the problem. While converting İ to i, it is adding a diacritic dot on the next character.
That was really very strange
Any ways, I think the problem will be solved after a new training from scratch.

Topic		Replies	Views
Datasets.load_datasets fails 🤗Datasets	12	873	October 11, 2024
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Beginners	3	12113	August 23, 2023
UnicodeDecodeError when loading Mulit Lingual text file 🤗Datasets	1	2372	April 7, 2022
Random utf-8 errors from dataset Intermediate	10	3676	December 8, 2023
Unable to load common_voice dataset 🤗Transformers	0	533	February 11, 2022

Turkish characters gets corrupted when loading dataset via audiofolder

Related topics