Today, I downloaded the common_voice dataset and Windows defender claims to have found a Trojan named Oneeva.A!ml in there and classified it as a severe threat. As affected item it referred to a specific mp3-file in clips. I guess it was a false alarm, but I’d like to be sure. Does anyone know anything about this? Any help is much appreciated!
Hi ! From where did you get this file ?
Did you download the dataset using load_dataset ?
Hi! Yes, I used load_dataset. According to windows defender the affected item was common_voice_de_17677975.mp3. After having removed the files, I ran a full system scan with no (further) threats detected. Thanks for asking!
Ok thanks ! We’re taking a look at it
Hi @lhoestq ,
I have also noticed that in other datasets such as the one used in the Code Parrot training blog there are instances of files being scanned as unsafe - trojans, malware, spyware, etc. It looks as if they are all non-executable since they are stored in JSON format in the arrow files. For example, this file is marked as unsafe - Unsafe: Win.Trojan.MSShellcode-88.
Thank you,
Enrico
Hi, I forgot to mention that I downloaded the German common_voice dataset–sorry about that!
We scanned common_voice_de_17677975.mp3
for malwares using clamscan and the file is OK
Regarding the Code Parrot dataset, several files contain source code of malwares indeed (the dataset contains a big part of github so this is expected). But these are just text files so you are fine as long as your don’t try to execute the code
Thanks a lot for looking into it–that’s reassuring!
Yes, you would hope that someone would not extract and execute those code segments.
Thank you for the update.
Enrico