I have recently started working on an AI app to detect C/C++ source code vulnerabilities. My understanding is that for the training and validation, I need to input (to the model) both safe and unsafe code examples. The problem is that I cannot find a dataset anywhere, that clearly delineates between the two — they all either contain nothing but unsafe code examples, or contain a single file (pkl or json) that contains both safe and unsafe together/merged.
I thought there may be some datasets that would have something like one directory (or file) that contains only safe, and another that contains only unsafe.
Any help here would be appreciated.