Pre-trained embedding model on API specification files for RAG use case

Hello, I have a set of files containing API specifications. Some of them are in json/yaml using swagger or openAPI specs, most of them are in either markdown or multi sheet excel file.
The markdown and excel files dont follow a standard template but contain API spefications info like the path, method, request, response, return code, ..etc
For technical document like this, what would be a good pre-trained embedding model to use to cater to returning precise results for API catalog search/chat/find-APIs-for-this-userstory kind of RAG use case.
Let me know if you need further details, any recommendation or direction in this regard is appreciated.

1 Like

For RAG applications, I think the first step is to find an embedding model with good retrieval performance from the MTEB leaderboard and try it out. For that type of document, it might be better to use a straightforward, high-performance model rather than a specialized model. If multilingual support is required, it might be a good idea to use a larger model.

I’ve worked on a similar project where we had to deal with inconsistent API documentation formats like markdown and Excel sheets. Using a model like OpenAI’s embeddings combined with some custom preprocessing to normalize the data worked well. Also, for quick brainstorming or even casual coding breaks, I sometimes use Omegle to chat with strangers about tech topics—it’s surprisingly helpful to get fresh perspectives on tricky problems!

1 Like