AI Driven Synthetic Custom Datasets for Finance and Citizen Science

tuc111 · July 25, 2025, 5:10pm

Custom Synthetic Datasets for Finance & Citizen Science Applications

Hi everyone!

I’m Emmitt from Grandma’s Boy Labs, and I’m excited to share a new project that might be of interest to folks building and fine-tuning LLMs for specialized domains like finance and healthcare.

What I’m Building

I’ve been developing high-quality synthetically generated conversational datasets using GPT-based roleplay simulations. These datasets are created through structured prompt engineering, where models simulate realistic expert-client conversations based on career personality profiles, domain knowledge, and user intent.

Current Focus Areas:

Finance & Investment Advising

Portfolio management strategy discussions
Risk tolerance assessments
Client education (e.g., explaining ETFs, diversification, etc.)
Investment advising scenarios (new investors, retirees, etc.)

Citizen Science & Public Health

Maternal health Q&A simulations (based on real patient education needs)
Community-driven knowledge-building examples
Accessible and diverse synthetic dialogue for low-resource domains

These datasets are designed to be modular, scalable, and adaptable to any model with API access (not limited to GPT—bring your own LLM!).

How the Datasets Are Made

Using a blend of:

Role-based personas with detailed career profiles
Prompt chains for guided conversation structure
Dialogue simulations for specific use cases
Annotated outputs (for fine-tuning, QA, or supervised RL)

Each dataset is formatted in .json with structured fields for:

speaker
message
topic
turn_index
tags (e.g., “client education”, “risk profiling”)

Availability

You can browse and purchase the datasets directly from grandmasboylabs.com. I offer:

One-time dataset purchases
Full documentation and metadata on generation process
Licensing for commercial and open-source fine-tuning

Coming Soon: Model Demo

I’m also working on a Hugging Face Space demo of a fine-tuned investment advising model trained on one of the early datasets. Users will be able to:

Try a simulated intake form
Interact with the model in real time
Explore how the dataset translates to fine-tuned behavior

Open to Collaboration

I’d love to connect with:

Researchers working on synthetic data evaluation
Practitioners fine-tuning models for verticals like finance or health
Developers building personalized advisory tools with LLMs

If you’re curious about the data, want to collaborate, or just want to chat about synthetic datasets—reach out! I’d love to hear your thoughts and feedback.

Looking forward to connecting!

Ernst03 · August 2, 2025, 10:29pm

Well, It is so wonderful that people come and present.
I am aware of mathematical structure and meaning of the symbology is the concern.

Topic		Replies	Views
AI-Driven Synthetic Data Generation Show and Tell	2	9	August 4, 2025
Unlock AI training data with the open-sourced Synthetic Data SDK Show and Tell	0	45	February 4, 2025
What is the text dataset format for fintune LLM? Beginners	2	2742	June 8, 2023
Fine tune LLM in our competition for mental health research - £500 ($648) available to win! Community Calls	0	52	October 23, 2024
Open Discord Chat Dataset (+ Model): Internet Tone Dataset for LLMs and ML 🤗Datasets	0	14	August 5, 2025