Hi everyone,
I am quite new to working with llms. Have a business background with limited tech know-how. Know I need some support in how to prepare and
My plan is the following: I want to finetune a pretrained model on scientific knowledge, analysis and methods to perform and analyse
The idea is to train the model on business specoific knowledge starting quite broad with general knowledge and going very deep with exact business analysis, evaluations etc.
Example:
- 100 Books: general economic know how (Macro, Micro) background knowledge
- 50 books: More specific business knowledge: Accounting principles, Controlling,
- 20 books: valuation methods using python, fraud detection in finance ( which data sources are need, which stepos to analyze)
Analyse publicly availabe analyst reports to basically reverse-engineer them:
which analyses have been done on what data, and which patterns have led to which conclusions. I have around 1500 Analyst reports as pdf stored in a mysql database already.
then I want to Identifiy the following information while doing reverse engineering: Company, data beeing used, analysis performed, patterns identified, resolving decision to invest or not to invest.
In the end the idea is to use the finetuned model to screen data, find patterns and identify e.g. investment potential.
I have some 400+ scientific literature on different topics and I want the model to be finetuned on that knowledge in the first step:
- Process and prepare the literature (Pdf, Epub, Txt, etc. incl. pictures)
- Bring all the literature in one format (Whats the best format? How should I store it - inside a database? Or as raw files?)
- Divide the content into topics and pieces e.g. alongside the chapters and table of content? Would this be helpful to store that content and the strucutre inside a database?
- Merge all books on a general structure and bring together similiar topics together (r.g. all chapters about company valuation (general knowledge, anylsis & methods)
- Can I use an LLM to first analyze the books alongside its content and wether it contains general knowledge or practical methods and analysis to perform?
- Goal: Have a large databases of relevant topics and content
- How should the training process should look like optimal? What kind of training dataset shall I create? Just questions? Or some examples? Thats the part where I am pretty lost. Or is RAG the best way?
Are my steps fine? What are your thoughts on how I should do it, in what form?
Thanks a lot! I appreciate any help and support!
Best, Ben