Experience with and extending LLM for software engineering

Hi from a newbie to this exciting forum :slight_smile:

I have been using various models to supercharge code generation and learning. Here in order of preference or usefulness from my experience FWIW, are the latest models of
claude.ai
ChatGPT
perplexity.ai
Git CoPilot - either in PyCharm :upside_down_face: or Neovim

I am looking to go from requesting snippets of code and methods/functions from shortish prompts to a more ambitious next step. This step is to pass more extensive parts of a software project to the LLM and to request useful methods/functions.

The type of code I want to pass are a series of Pedantic BaseClass models together with example objects and definitions and also explanatory text. I want to generate compliant code for he key methods that operate on these models.

I am writing before actually giving this a whirl. I know, of course, that there are limits to the number of tokens to pass to and receive from an LLM.

So, I am wondering if there any resources emerging, ideally in this great HF ecosystem that is suited for this scope.

Any advise on how to go to the next (wo)man-machine level is much appreciated.

Thanks to and for :hugs:

E

Hi @Allom

Great to see your journey in code generation :slight_smile:! For passing extensive code to LLMs:

  1. Chunking: Break code into smaller parts.
  2. Fine-tuning: Customize a model for your needs.
  3. External Memory: Use vector databases.
  4. Documentation: Provide clear examples.
  5. Hybrid Approaches: Combine tools and models.

Check out Hugging Face for resources like model cards and datasets. Have you explored specific models here?

Thank you @LLUMOAI for that kickstart. To the extent that I can get my head around it, makes a lot of sense.

I can see vector databases as useful if you have some standard content that you want to be persistently available. Or if your chunks are really massive… (But that is just newbie thinking).

In my case its just several thousands of lines of python.
We use Pydantic v2 models extensively and generate with LLM methods to act on instances of these models. As you say we provide clear examples and documentation. to the LLMM.

I know nothing of model cards and datasets available from Hugging Face and would love to learn more how this could help to generate high quality code to specification.

Many thanks for your support and help and super excited about learning more.
Eric

Hi @Allom

Glad to hear it helped! For learning about model cards and datasets on Hugging Face, explore their model hub and dataset section for detailed info and resources. Happy to assist further!

Hi @LLUMOAI, hi all

We are now closing in on the task of choosing a good model and then training and finetuning it with datasets.

The objective is to generate the code for methods that operate on our defined Python Pydantic and Enum class instances.

Of course we want to use models and datasets that are well tested and suited for this task and models that allow us to train and fine tune with our own code (mainly the model definitions themselves) and relevant projects and of course Huggingface available datasets as considered useful.

It would be ideal if the model is convenient to deploy and train on the Huggingface platform (or locally) and without extensive effort.

What is working:
Searching for models/datasets by name and then ranking these by likes, downloads etc.

What is not working:
Searching for these by Full-text search [Search terms: text-to-code code generation python pydantic enum] and then tanking by likes, downloads etc
With Huggingface’s Full-text search, a long list is received and it is difficult to know about quality without such a ranking.

Here are some of the seemingly better established models that might be suitable for this task.

Any pointers on how to make a quickstart with well established models and datasets would be much appreciated.

:hugs:

E