One of my datasets was marked unsafe

Hi,

one of the parquet files in a dataset I have previously uploaded is now marked as unsafe (mark appeared after updating the README.md, but I don’t think they are related):

How is this possible? Aren’t parquet files just binary column ordered databases?
Each column is a string here, does this mean that one of the rows has a code for a trojan as string instead? Maybe I crawled malicious code instead of a site’s content?
Or does this mean that my PC has a trojan in my pandas / pyarrow library that adds malicious code to the saved parquets? The other datasets I have don’t seem to have this issue (at least they aren’t marked at the moment) which makes it even more weird.

Do I just delete this single parquet file? Will the rest of the chunks still load as intended without it?
Which antivirus software should I switch to now (I scanned both the shard and the files used to generate it but neither ESET nor Windows Defender managed to find anything, and I couldn’t even Google this trojan)?

Thanks for the help in advance!

Hello,

Is your Parquet files compressed ? I think we just have a false positive here. It happens with parquet files and some of the rules that match viruses are sometimes too general.
You can try uploading the file to https://www.virustotal.com/ and see what you get, but IMO you should be ok.

If your Parquet is compressed, you can try decompressing it and running the an antivirus on the resulting file(s).

does this mean that my PC has a trojan in my pandas / pyarrow library that adds malicious code to the saved parquets

That sounds unlikely to me.

Do I just delete this single parquet file?

I’d say it’s ok to leave it for now, I’ll check what the virus signature looks like when I can and report to the ClamAV team.

Which antivirus software should I switch to now

I don’t think you need to switch to something else, we use https://www.clamav.net/ but I reckon it’s more hands-on and not as practical for personal usage.
You can download it and run sigtool --find-sigs <YOUR_SIGNATURE> if you want more info on why it matched.

2 Likes

Thank you for the detailed, in-depth explanation!

No worries.

This is the decoded signature :

$> sigtool --find-sigs Win.Trojan.Javel-1 | sigtool --decode-sigs
VIRUS NAME: Win.Trojan.Javel-1
TARGET TYPE: ANY FILE
OFFSET: *
DECODED SIGNATURE:
KREATIVITY FOR KATS

My guess is that you have the string KREATIVITY FOR KATS in your file somewhere, triggering the AV.

You may have scrapped a virus in that case, I’ll let you check and get back to me.

1 Like

it should be this ebook, it’s scraped as plain text (no script tags are included, but it doesn’t have any anyway):

https://www.gutenberg.org/files/51493/51493-h/51493-h.htm

The ebook is at index 479, and only contains running text after a quick inspection. It’s metadata:

{‘language’: ‘en’,
‘text_id’: 51493,
‘title’: ‘Kreativity For Kats’,
‘issued’: ‘2016-03-18 00:00:00’,
‘authors’: ‘Leiber, Fritz, 1910-1992; Francis, Dick [Illustrator]’,
‘subjects’: ‘Short stories; Cats – Fiction’,
‘locc’: ‘PS’,
‘bookshelves’: ‘Science Fiction’}

@mcpotato Should I reupload the dataset without it or just let it go? Is there a way to reupload that single parquet chunk alone or do I have to delete and recreate the whole dataset?

Thanks in advance!

Never mind, I just took out that single file and reuploaded the whole thing. Thanks for taking the time to find the signature!

1 Like