Unlock AI training data with the open-sourced Synthetic Data SDK

spintronic · February 4, 2025, 9:48am

MOSTLY AI has open-sourced its powerful Synthetic Data SDK, enabling you to create privacy-preserving, AI-generated synthetic data directly from your existing datasets - all within your secure environments.

Key Features:

Broad Data Support: Handle mixed data types (categorical, numerical, geospatial, text), single/multi-table datasets & time-series data.

Multiple Model Types: Leverage TabularARGN (SOTA for tabular data), fine-tuned HuggingFace models, and efficient LSTM for text generation.

Advanced Training Options: CPU/GPU support, differential privacy, and real-time progress monitoring.

Automated Quality Assurance: Built-in fidelity & privacy metrics with detailed HTML reports for visual data analysis.

Flexible Sampling: Upsample data, generate conditionally, rebalance segments, impute context-aware values, ensure fairness, and control outputs via temperature adjustments.

Seamless Integration: Connect effortlessly to external databases & cloud storage with a fully permissive open-source license.

Check out the SDK on GitHub: GitHub - mostly-ai/mostlyai: Synthetic Data SDK ✨

Topic		Replies	Views
Fine Tune text generation Model using different type of data 🤗Transformers	0	358	August 1, 2023
Is there any mothed speed generate examples 🤗Datasets	1	351	September 23, 2022
GPT_J custom dataset Beginners	1	175	November 9, 2022
New tool to improve performance of generative AI models Models	0	766	April 2, 2023
Fine tune the text generation with gpt2 Beginners	2	453	February 22, 2023

Unlock AI training data with the open-sourced Synthetic Data SDK

Related topics