Data storage for pre training Language Model

ayushkr03 · August 20, 2025, 5:11am

Hey folks,

We’re building a Small Language Model (SLM) for the financial domain using a decoder-only architecture (~40M params, 2k context). Our data sources are pretty diverse — SEC filings (10-K, 10-Q, 20-F), IFRS/GAAP manuals, earnings call transcripts, financial textbooks, Wikipedia (finance), and news articles. These come in formats like PDF, HTML, TXT, iXBRL, ePub.

Our pipeline looks like this:

Collect raw files (original formats).
Pre-process (filter finance-specific content, normalize).
Store processed files.
Chunk into ~2048 tokens.
Store chunks for mixing batches across sources.

We’re trying to figure out the best way to store and index files/chunks:
• Directory hierarchy + manifest/index files?
• Flat storage with metadata indices?
• Use a vector DB (Pinecone/Milvus) only for chunks, keep raw/processed in blob storage?
• How do you usually handle train/test splits — doc-level or chunk-level?

John6666 · August 20, 2025, 7:10am

Directory hierarchy + manifest/index files?
Flat storage with metadata indices?

I think shallow, flat, and sharded datasets such as WebDataset tend to be faster with large datasets.

doc-level or chunk-level?

Seems doc-level is better.

entfane · August 20, 2025, 1:44pm

I would suggest Axolotl for this, it provides a very simple wrapper, allowing you to store the data as a simple hf dataset of any size

Topic		Replies	Views
Question on language modeling preprocessing 🤗Transformers	2	344	January 21, 2021
Distilbert-base-nli-stsb-mean-tokens OOM encoding sentences of 100K docs Beginners	4	693	February 9, 2021
Text format for language modeling 🤗Transformers	5	2385	October 10, 2021
Best free options if you want to train a language model on a small set of private documents? Beginners	3	478	April 5, 2024
Saving underlying language model after trained on downstream task 🤗Transformers	0	426	September 14, 2020

Data storage for pre training Language Model

Related topics