Newbie needs help: Training models on science literature

benmyr · November 24, 2024, 11:51am

Hi everyone,
I am quite new to working with llms. Have a business background with limited tech know-how. Know I need some support in how to prepare and

My plan is the following: I want to finetune a pretrained model on scientific knowledge, analysis and methods to perform and analyse

The idea is to train the model on business specoific knowledge starting quite broad with general knowledge and going very deep with exact business analysis, evaluations etc.
Example:

100 Books: general economic know how (Macro, Micro) background knowledge
50 books: More specific business knowledge: Accounting principles, Controlling,
20 books: valuation methods using python, fraud detection in finance ( which data sources are need, which stepos to analyze)

Analyse publicly availabe analyst reports to basically reverse-engineer them:
which analyses have been done on what data, and which patterns have led to which conclusions. I have around 1500 Analyst reports as pdf stored in a mysql database already.

then I want to Identifiy the following information while doing reverse engineering: Company, data beeing used, analysis performed, patterns identified, resolving decision to invest or not to invest.

In the end the idea is to use the finetuned model to screen data, find patterns and identify e.g. investment potential.

I have some 400+ scientific literature on different topics and I want the model to be finetuned on that knowledge in the first step:

Process and prepare the literature (Pdf, Epub, Txt, etc. incl. pictures)
Bring all the literature in one format (Whats the best format? How should I store it - inside a database? Or as raw files?)
Divide the content into topics and pieces e.g. alongside the chapters and table of content? Would this be helpful to store that content and the strucutre inside a database?
Merge all books on a general structure and bring together similiar topics together (r.g. all chapters about company valuation (general knowledge, anylsis & methods)
Can I use an LLM to first analyze the books alongside its content and wether it contains general knowledge or practical methods and analysis to perform?
Goal: Have a large databases of relevant topics and content
How should the training process should look like optimal? What kind of training dataset shall I create? Just questions? Or some examples? Thats the part where I am pretty lost. Or is RAG the best way?

Are my steps fine? What are your thoughts on how I should do it, in what form?

Thanks a lot! I appreciate any help and support!

Best, Ben

Topic		Replies	Views
Fine Tuning LLM Research	0	1712	August 16, 2023
Building Own Knowledge Base LLM Beginners	1	1574	April 6, 2024
Need Suggestion Research	2	215	April 19, 2024
How to fine-tune an LLM model with an entire document in a format such as *.txt/docx/pdf ect 🤗AutoTrain	6	7212	August 21, 2024
Best practice for finetune LLM Intermediate	0	640	June 21, 2023

Newbie needs help: Training models on science literature

Related topics