Which Datasets for RAG and fine tuning LLM?

Shivaa999 · February 28, 2025, 11:18am

I want to make a RAG + BERT model to advise on Indian legislation ,but I don’t know what kind of different dataset should I use for RAG and LLM?? Which dataset would be efficient for them?

John6666 · February 28, 2025, 3:32pm

By Hugging Chat:

To create a Retrieval-Augmented Generation (RAG) + BERT model for advising on Indian legislation, follow these organized steps:

1. Data Collection

RAG Dataset:
- Use the LEDGAR dataset from Hugging Face, which, while not specific to India, provides a strong foundation for legal documents. Supplement with the Indian Constitution, acts, and case laws. Consider sourcing from Indian government websites for specific regional legal documents.
LLM Dataset:
- Combine datasets like English Gigaword and BookCorpus for a broad corpus. Enhance with Indian-specific content such as legal blogs, news articles, and government publications. Consider multilingual support if regional languages are needed.

2. Data Preprocessing

RAG: Preprocess legal documents into chunks with overlap. Ensure proper formatting and cleaning to enhance retrieval accuracy.
LLM: Tokenize text, remove duplicates, and split data into training and validation sets. Consider synthetic data generation to supplement underrepresented areas.

3. Model Selection and Fine-tuning

Use Legal-BERT for its legal context understanding. Consider ModernBERT for longer context handling. Fine-tune these models on your legal corpus.

4. Retrieval System Setup

Vector Database: Implement FAISS or ChromaDB for efficient document retrieval. Preprocess documents into embeddings using models like Legal-BERT or Doc Dale for better accuracy.

5. Integration and Validation

Integrate the retrieval and generation systems. Test the model with diverse queries to ensure relevance and accuracy. Validate the model’s performance, especially in handling complex legal queries.

6. Considerations

Language Support: Use multi-language BERT if handling regional languages.
Data Availability: Address potential scarcity by augmenting with synthetic data and sourcing from reputable Indian legal sites.

7. Implementation Steps

Outline and execute each step methodically, starting with data sourcing, preprocessing, model setup, and integration, ensuring thorough testing and validation throughout.

By following these steps, you can develop an effective RAG + BERT model tailored for advising on Indian legislation, leveraging available datasets and models while addressing specific regional needs.

Topic		Replies	Views
Seeking Advice on Fine-Tuning a Legal Language Model for Nepalese Law (LLM + RAG) 🤗 Course Projects	0	135	February 25, 2025
How to fine-tune an LLM model with an entire document in a format such as *.txt/docx/pdf ect 🤗AutoTrain	6	7121	August 21, 2024
Generate dataset for fine tuning on PDF(s) 🤗Transformers	6	3117	September 3, 2024
Although doing RAG does it worth fine tuning the LLM on the documents? - Llama2 Intermediate	1	1520	September 14, 2023
Fine-Tuning + RAG based Chatbot: Dataset Structure & Instruction Adherence Issues Intermediate	7	306	March 11, 2025