Organizing Data

I’m training an SSM right now, specifically the Mamba model from Select States. I’ve managed to set up a basic training loop for the 370m model. I can get it to output half coherent and sometimes correct information, but mostly it hallucinates. This is likely due to the fact that I have some 35 pdf’s covering primarily the insurance industry and a lesser amount of basic high school comp and some things like dictionaries and thesauruses in a csv.

This is me training my model

This is me building the CSV

I’ve cobbled this much together using various resources here and chatgpt, but I’m not sure how to actually go about structuring the data in any meaningful way, and how to model the data for casual model. I’m probably doing lots of other stuff wrong too. I need someone to point out which things I’ve missed entirely and what I can do better. At this point my assumption is everything I’ve done is wrong, but I have some of it working.

Also is there a more gpu efficient way for me to be handling the data on colab? I can only train up to a 370m parameter model, which is fine I think, the end goal is summarizing insurance emails, and information retrieval.