Hi everyone,
I have about a hundred documents describing different courses, which I’ve transformed into embeddings. Based on a user’s input, I want to recommend what they could study next.
Example: “I want to study AI.”
I’m using ChromaDB for similarity search together with OpenAI’s text-embedding-3-large model. ChromaDB returns cosine distances in the range of 0 to 2. In my case, the best match has a distance of around 0.8, which seems relatively high — but it’s still the best among the available options.
The distance is high (and “bad”) because the user input is much shorter than the course descriptions. This ratio will always exist and may even increase over time. In my opinion, this high distance doesn’t matter much — all distances might look bad, but they still show which courses fit best. It’s still the best recommendation.
Since the embeddings come directly from the OpenAI model, there’s not much I can change about how they’re generated, I guess.
At the moment, my plan is simply to list the courses sorted by distance, regardless of how “bad” the distances are. This feels almost too simple.
Is there something else I could be doing to improve the quality of these recommendations? I don’t want to apply a hard threshold, since even larger distances can still be useful.
Any advice or best practices for handling this kind of situation would be greatly appreciated.