Hi all, I’m trying to figure out how understanding and generating source code and/or structured data is handled by LLMs, but I’m struggling to find any starting points. I have questions like, do people treat code and data as separate modalities to natural language, or are they handled as one modality with some kind of translation step? I’ve a fair idea how multi modal models work generally but what I’m looking at doesn’t seem so clear cut as the difference between natural language and images, for example, because natural language and source code both “are” text, in some sense.
The code/data I’m specifically interested in is HTML as rendered in the browser. I want to get a model to understand the structure of what the browser is displaying so as to gain an understanding of the application that’s generating that structure. I have a tokenizer that can extract/encode meaning from what the browser is displaying, but I don’t know how to continue with that… I understand replacing/extending output layers to fine-tune a model, but as far as I can tell different tokenizer would be replacing/extending input layers and I don’t see how you’d do that… it’d be like having a preprocessor converting my tokenization into a the model’s tokenization, but that seems like the wrong approach.
Does anybody know any research or other good starting points I can look at to unstick myself?