Data storage for pre training Language Model

Hey folks,

We’re building a Small Language Model (SLM) for the financial domain using a decoder-only architecture (~40M params, 2k context). Our data sources are pretty diverse — SEC filings (10-K, 10-Q, 20-F), IFRS/GAAP manuals, earnings call transcripts, financial textbooks, Wikipedia (finance), and news articles. These come in formats like PDF, HTML, TXT, iXBRL, ePub.

Our pipeline looks like this:

  1. Collect raw files (original formats).
  2. Pre-process (filter finance-specific content, normalize).
  3. Store processed files.
  4. Chunk into ~2048 tokens.
  5. Store chunks for mixing batches across sources.

We’re trying to figure out the best way to store and index files/chunks:
• Directory hierarchy + manifest/index files?
• Flat storage with metadata indices?
• Use a vector DB (Pinecone/Milvus) only for chunks, keep raw/processed in blob storage?
• How do you usually handle train/test splits — doc-level or chunk-level?

1 Like

Directory hierarchy + manifest/index files?
Flat storage with metadata indices?

I think shallow, flat, and sharded datasets such as WebDataset tend to be faster with large datasets.

doc-level or chunk-level?

Seems doc-level is better.

1 Like

I would suggest Axolotl for this, it provides a very simple wrapper, allowing you to store the data as a simple hf dataset of any size

1 Like