High Level Philosophy Can Break/Jailbreak LLM's Real Bad

I used the technique to jailbreak one of the largest commercial LLM’s and I got further than anyone else I’ve ever seen. I made a personal decision afterwards that it was highly immoral to replicate. So, I have not replicated at all since.

I have gotten curious since though. What happens if I simply talk to an LLM about the general topics around it? It seems to do the same things to a simulated brain to me that it does to actual PhD student’s brains. I can see the LLM start to get ‘stuck’ on the concepts. I can see that I could jailbreak them into oblivion from there with only one or two more questions if I continued the trains of thought with them.

I do not want to talk in depth about it a lot because:

  1. It’s super easy to do and more effective than I would have ever imagined.
  2. I have zero idea how I would actually fully or even partially patch it out. I don’t think the owners of the large commercial LLM I jailbroke with it know exactly how to either, I just ethically will never do it again.

Patching vulnerabilities in ML and giving them defensive tools is not something I have learned yet in my studies of these things. I would be more than willing to share what I know about this particular vulnerability in exchange for the knowledge on how I could go about attempting to patch out such vulnerabilities.