Hello everyone,
I’m relatively new to working with LLM and I’ve encountered a challenge that I’m hoping to get some help with. I am working on a project that involves analyzing potentially malicious scripts. The process involves reading a file containing malicious code (e.g., a VBS script), submitting this code to a language model for analysis, and then receiving an analysis of the script’s behavior.
Here’s the issue I’m facing: when I submit the script content to the language model, the generated response includes both the requested analysis and the submitted source code of the script. What I want is to only receive the analysis from the model, without having the source code repeated in the response.
I am using the following libraries for my project: ctransformers
, huggingface-hub
, and langchain
. Here’s a snippet of my current code :
malware_path='Malwares/malware.vbs'
with open(malware_path, 'r', encoding='utf-8') as file:
vbs_content=file.read()
from langchain.llms import CTransformers
from huggingface_hub import hf_hub_download
model_repo = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
model_filename = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"
# This downloads the model file to a local path
model_file_path = hf_hub_download(repo_id=model_repo, filename=model_filename)
print(f"Model downloaded to: {model_file_path}")
config = {
'max_new_tokens': 2000,
'temperature': 0.7,
'repetition_penalty': 1.1,
'context_length':4096,
'stream':True
}
llm = CTransformers(model=model_file_path, model_type="mistral", config=config)
if torch.cuda.device_count() > 1:
llm.model = torch.nn.DataParallel(llm.model)
llm.model.to('cuda')
def split_into_chunks(text, chunk_size=2500):
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
return chunks
# Preparing the prompt
question = "Can you analyse the following malware without including it in your response? Provide a general description of its behavior and list any indicators of compromise (IOC). \n\n"
# Splitting the content into chunks
chunks = split_into_chunks(vbs_content)
responses = []
for chunk in chunks:
prompt = f"{question}Analyse of the following content (do not include the content in the answer):\n{chunk}"
# Generating the response for each chunk
response = llm(prompt)
responses.append(response)
# Concatenating the responses to get the complete analysis
separator = "\n### CHUNK END ###\n"
final_response = separator.join(responses)
print(final_response)
I have been searching for a way to either configure the model or process its output so that the initial content (i.e., the malware source code) is excluded from the generated response. However, I haven’t been able to find a solution on my own.
Could anyone share advice or suggestions on how to approach this issue? I’d greatly appreciate any guidance, especially since I’m a bit new to this field and might be missing something obvious.
Thank you very much for your help!