Is there a vision model for zero-shot clustering?

I have a dataset of images from The Simpsons. I have trained a face-detection model for Simpsons characters with good results, and after lots of experimenting, I have written a script that gives relatively accurate binary images of the faces. I will attach a screenshot as an example. I have also tried using cv2.findContours to find the contours and treat these as matrices to compute the difference between 2 faces, with no luck.

My end goal is to be able to cluster these faces by character. I have tried some more basic ML algorithms without success, and now I think this task may be too complex for that.

I am wondering if there is a vision model that could be well-suited for this? Or if anyone has suggestions for other approaches that would be great too.

Here is an example of my processed Simpsons faces that I want to cluster:

As a side note, I don’t care that much if, for example, images of Bart Simpson from a front angle end up in a different cluster from images of him from a side angle, as it will be easy enough to manually merge these clusters after the fact.