Turkish characters gets corrupted when loading dataset via audiofolder

yilmazay · April 3, 2023, 8:57am

Hi Everyone,
I tried an xlsr finetuning using CommonVoice Turkish data.
It get trained successfully and I can decode Turkish waves although the accuracy not so good.
The decoded Turkish strings are valid, without any non-Turkish characters.
Then, I fine-tuned xlsr again with our own Turkish data in the same way as described in a blog.
Although finetuning completed without errors and I can decode Turkish audios with that generated model, however, the decoded text is very strange. There are a lot of diacritic dots almost on all characters. In Turkish no diacritic dots are used, so the decoded text looks very weird and anormal (as the following text):
ė ̇ȯṅu̇ṅ ̇bu̇ ̇ṡë̇v̇ṁî̇rî̇â̇i̇ṅ ̇ȯṙṫȧḋȧn ̇k̇ȧl̇düṙü̇yi̇ṁḋėṙk̇ė ̇k̇ȧrȧḋȧ ̇k̇ȧl̇ḋü̇ṙmü̇ğ̇ ̇î̇l̇k̇yėṫtė ̇kȧl̇düṙṁü̇ğ̇ ̇i̇ṅṡȧṅ ̇
When I look at the loaded data which are saved as arrow files in the cache, they are also corrupted. (as seen below):
“her ne hikmetse tekrar canlÄ± yayÄ±n sesi aynÄ± ÅŸekilde devam ediyoristemiyorum tabii bÃ¶yle bir ÅŸey sÃ¶z konusu deÄŸil bÃ¶yle bir uygulama”. In addition, arrow files’ file encoding becomes ASCII. ( I was expecting them to be UTF-8)
Whereas my input metadata.csv is totally OK and encoded as UTF-8.
Thus, I’ve come up to the idea that the data gets corrupted while loading the dataset data.
So, it seem to me that load_dataset() function has problems with UTF-8 encoded data.
I m loading data as below
saibld_train = load_dataset(“audiofolder”, data_dir=train_data_path)
where train_data_path is a local folder in which a UTF-8 encoded metadata.csv and the audio files reside.
My question is:
What should I do to force load_data() function of datasets to be able to correctly load non ASCII, UTF-8 (Turkish characters) without getting corrupted?
Any help or recommendation is very much appreciated.
Thanks in advance.
Yılmaz A.

yilmazay · April 4, 2023, 8:38am

After a lot of debug and analysis, I found the root cause of the problem.
It happened to be the input metadata.csv file’s encoding problem.
It was not load_dataset()'s fault.
While I am generating csv, I convert all content into lowercase using Turkish locale.
Interestingly Turkish upper case I, which is İ, uppercase I with a dot on top of it, was causing the problem. While converting İ to i, it is adding a diacritic dot on the next character.
That was really very strange
Any ways, I think the problem will be solved after a new training from scratch.

Topic		Replies	Views
Datasets.load_datasets fails 🤗Datasets	12	747	October 11, 2024
Random utf-8 errors from dataset Intermediate	10	3439	December 8, 2023
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Beginners	3	11960	August 23, 2023
Missing one feature in dataset when loading from folder 🤗Datasets	2	573	October 31, 2023
Cannot load Audio dataset 🤗Datasets	1	613	June 11, 2023

Turkish characters gets corrupted when loading dataset via audiofolder

Related topics