Hi everyone,
I’m working on building an AI-powered legal assistant focused on Nepalese law. My goal is to create a model that can provide legal advice by understanding and interpreting laws, acts, and judicial decisions in both Nepali and English.
Currently, I’m planning to use a combination of:
- Fine-tuned LLMs (like Legal-BERT, mBERT, or GPT-2) for legal reasoning.
- Retrieval-Augmented Generation (RAG) to pull up-to-date legal information (Constitution, Civil/Criminal codes, etc.) without needing constant retraining.
What I’ve done so far:
- Collected legal texts: Constitution of Nepal (2072), Muluki Ain (2017), and other acts.
- Started preparing a question-answer dataset for fine-tuning.
- Exploring FAISS and LangChain for RAG implementation.
What I need help with:
- Model selection:
- Would Legal-BERT be a good choice for fine-tuning legal Q&A, or should I use mBERT since my data involves both Nepali and English?
- Is GPT-2 suitable for generating long-form legal explanations?
- RAG setup:
- For a legal AI, would you recommend FAISS or ChromaDB for storing and retrieving legal document embeddings?
- How can I balance retrieval accuracy with generation quality?
- Handling bilingual capabilities:
- Should I fine-tune the model in Nepali directly, or train in English and use a translation layer for outputs?
- Any suggestions for models like BLOOM or mBERT that support Nepali?
- Fine-tuning strategy:
- For fine-tuning, should I use a SQuAD-style Q&A format or focus on situation-based legal questions?
- Any best practices for avoiding hallucinations in legal answers?
I want to build a model that doesn’t just generate answers but cites the correct articles or acts — ensuring transparency and trust.
Would really appreciate your expert insights on how to refine this system, avoid pitfalls, and structure the pipeline efficiently.
Thanks in advance — excited to hear your thoughts!