one of the parquet files in a dataset I have previously uploaded is now marked as unsafe (mark appeared after updating the README.md, but I don’t think they are related):
How is this possible? Aren’t parquet files just binary column ordered databases?
Each column is a string here, does this mean that one of the rows has a code for a trojan as string instead? Maybe I crawled malicious code instead of a site’s content?
Or does this mean that my PC has a trojan in my pandas / pyarrow library that adds malicious code to the saved parquets? The other datasets I have don’t seem to have this issue (at least they aren’t marked at the moment) which makes it even more weird.
Do I just delete this single parquet file? Will the rest of the chunks still load as intended without it?
Which antivirus software should I switch to now (I scanned both the shard and the files used to generate it but neither ESET nor Windows Defender managed to find anything, and I couldn’t even Google this trojan)?
Is your Parquet files compressed ? I think we just have a false positive here. It happens with parquet files and some of the rules that match viruses are sometimes too general.
You can try uploading the file to https://www.virustotal.com/ and see what you get, but IMO you should be ok.
If your Parquet is compressed, you can try decompressing it and running the an antivirus on the resulting file(s).
does this mean that my PC has a trojan in my pandas / pyarrow library that adds malicious code to the saved parquets
That sounds unlikely to me.
Do I just delete this single parquet file?
I’d say it’s ok to leave it for now, I’ll check what the virus signature looks like when I can and report to the ClamAV team.
Which antivirus software should I switch to now
I don’t think you need to switch to something else, we use https://www.clamav.net/ but I reckon it’s more hands-on and not as practical for personal usage.
You can download it and run sigtool --find-sigs <YOUR_SIGNATURE> if you want more info on why it matched.
@mcpotato Should I reupload the dataset without it or just let it go? Is there a way to reupload that single parquet chunk alone or do I have to delete and recreate the whole dataset?