MOSTLY AI has open-sourced its powerful Synthetic Data SDK, enabling you to create privacy-preserving, AI-generated synthetic data directly from your existing datasets - all within your secure environments.
Key Features:
Broad Data Support: Handle mixed data types (categorical, numerical, geospatial, text), single/multi-table datasets & time-series data.
Multiple Model Types: Leverage TabularARGN (SOTA for tabular data), fine-tuned HuggingFace models, and efficient LSTM for text generation.
Advanced Training Options: CPU/GPU support, differential privacy, and real-time progress monitoring.
Automated Quality Assurance: Built-in fidelity & privacy metrics with detailed HTML reports for visual data analysis.
Flexible Sampling: Upsample data, generate conditionally, rebalance segments, impute context-aware values, ensure fairness, and control outputs via temperature adjustments.
Seamless Integration: Connect effortlessly to external databases & cloud storage with a fully permissive open-source license.
Check out the SDK on GitHub: GitHub - mostly-ai/mostlyai: Synthetic Data SDK ✨