Trojan in common_voice dataset?

kib-research · May 19, 2022, 2:36pm

Today, I downloaded the common_voice dataset and Windows defender claims to have found a Trojan named Oneeva.A!ml in there and classified it as a severe threat. As affected item it referred to a specific mp3-file in clips. I guess it was a false alarm, but I’d like to be sure. Does anyone know anything about this? Any help is much appreciated!

lhoestq · May 20, 2022, 10:57am

Hi ! From where did you get this file ?
Did you download the dataset using load_dataset ?

kib-research · May 20, 2022, 3:01pm

Hi! Yes, I used load_dataset. According to windows defender the affected item was common_voice_de_17677975.mp3. After having removed the files, I ran a full system scan with no (further) threats detected. Thanks for asking!

lhoestq · May 20, 2022, 4:30pm

Ok thanks ! We’re taking a look at it

conceptofmind · May 21, 2022, 3:57pm

Hi @lhoestq ,

I have also noticed that in other datasets such as the one used in the Code Parrot training blog there are instances of files being scanned as unsafe - trojans, malware, spyware, etc. It looks as if they are all non-executable since they are stored in JSON format in the arrow files. For example, this file is marked as unsafe - Unsafe: Win.Trojan.MSShellcode-88.

Thank you,

Enrico

kib-research · May 22, 2022, 4:23pm

Hi, I forgot to mention that I downloaded the German common_voice dataset–sorry about that!

lhoestq · May 24, 2022, 10:23am

We scanned common_voice_de_17677975.mp3 for malwares using clamscan and the file is OK

Regarding the Code Parrot dataset, several files contain source code of malwares indeed (the dataset contains a big part of github so this is expected). But these are just text files so you are fine as long as your don’t try to execute the code

kib-research · May 25, 2022, 7:42am

Thanks a lot for looking into it–that’s reassuring!

conceptofmind · June 30, 2022, 10:34pm

Yes, you would hope that someone would not extract and execute those code segments.

Thank you for the update.

Enrico

Topic		Replies	Views
One of my datasets was marked unsafe 🤗Datasets	6	2436	March 16, 2023
Offensive-powershell 🤗Datasets	0	135	July 16, 2024
One parquet file of my dataset was marked unsafe 🤗Datasets	1	97	October 24, 2024
Need Help Finding Appropriate Dataset(s) Beginners	0	136	August 16, 2023
Unable to load mozila-foundation/common_voice_8_0 Beginners	4	1772	March 18, 2022

Trojan in common_voice dataset?

Related topics