Trojan in common_voice dataset?

Today, I downloaded the common_voice dataset and Windows defender claims to have found a Trojan named Oneeva.A!ml in there and classified it as a severe threat. As affected item it referred to a specific mp3-file in clips. I guess it was a false alarm, but I’d like to be sure. Does anyone know anything about this? Any help is much appreciated!

Hi ! From where did you get this file ?
Did you download the dataset using load_dataset ?

Hi! Yes, I used load_dataset. According to windows defender the affected item was common_voice_de_17677975.mp3. After having removed the files, I ran a full system scan with no (further) threats detected. Thanks for asking!

Ok thanks ! We’re taking a look at it :wink:

1 Like

Hi @lhoestq ,

I have also noticed that in other datasets such as the one used in the Code Parrot training blog there are instances of files being scanned as unsafe - trojans, malware, spyware, etc. It looks as if they are all non-executable since they are stored in JSON format in the arrow files. For example, this file is marked as unsafe - Unsafe: Win.Trojan.MSShellcode-88.

Thank you,

Enrico

Hi, I forgot to mention that I downloaded the German common_voice dataset–sorry about that!

We scanned common_voice_de_17677975.mp3 for malwares using clamscan and the file is OK :wink:

Regarding the Code Parrot dataset, several files contain source code of malwares indeed (the dataset contains a big part of github so this is expected). But these are just text files so you are fine as long as your don’t try to execute the code :wink:

1 Like

Thanks a lot for looking into it–that’s reassuring!

Yes, you would hope that someone would not extract and execute those code segments. :sweat_smile:

Thank you for the update.

Enrico