Hey folks,
We’re building a Small Language Model (SLM) for the financial domain using a decoder-only architecture (~40M params, 2k context). Our data sources are pretty diverse — SEC filings (10-K, 10-Q, 20-F), IFRS/GAAP manuals, earnings call transcripts, financial textbooks, Wikipedia (finance), and news articles. These come in formats like PDF, HTML, TXT, iXBRL, ePub.
Our pipeline looks like this:
- Collect raw files (original formats).
- Pre-process (filter finance-specific content, normalize).
- Store processed files.
- Chunk into ~2048 tokens.
- Store chunks for mixing batches across sources.
We’re trying to figure out the best way to store and index files/chunks:
• Directory hierarchy + manifest/index files?
• Flat storage with metadata indices?
• Use a vector DB (Pinecone/Milvus) only for chunks, keep raw/processed in blob storage?
• How do you usually handle train/test splits — doc-level or chunk-level?