Building Own Knowledge Base LLM

Hello

I want build my own knowledge base Language Model (LLM), utilizing over 40GB of data including books and research papers. I’m eager to hear your suggestions and insights on how to approach this endeavor.

Specifically, I’m seeking guidance on:

  1. Approaches for constructing the LLM: What methodologies or frameworks would you recommend for building a robust LLM using my dataset?
  2. Data preprocessing techniques: How should I preprocess the data to ensure optimal performance and efficiency in training the model? Any specific tools or libraries you suggest for this task?
  3. Fine-tuning or RAG models: Would fine-tuning existing models or implementing RAG (Retrieval-Augmented Generation) models be beneficial for this project? If so, what are some best practices or resources to consider?

Hi! Currently searching for the same solution