Synthetic Data Generator using CTGAN
A production-ready Synthetic Data Generator designed to create realistic, configurable, and scalable synthetic datasets for machine learning, analytics, and testing use cases.
Generate realistic synthetic tabular data using CTGAN (Conditional Tabular Generative Adversarial Network) via the SDV (Synthetic Data Vault) library in Python.
This project demonstrates how to build an end-to-end synthetic data pipeline using Python, with extensibility toward microservices, AI workflows, and data platforms.
- Load real tabular data
- Train a CTGAN model
- Generate high-quality synthetic data that preserves statistical properties and relationships
- Compare real vs. synthetic distributions (basic checks)
Great for privacy-preserving data sharing, machine learning prototyping, testing pipelines, or augmenting small datasets.
Features
- Uses modern SDV ≥1.0 API (clean metadata handling)
- Supports mixed data types: numerical + categorical
- Two ready-to-use scripts:
- synthesizer.py → full training + generation (more customizable)
- synthesizer_quick.py → fast generation with default settings sample data not required
- Example real dataset (sample_data.csv) included
- Pre-generated synthetic examples (synthesized_data_1000.csv, quick_synthesized_data_1000.csv)
- Minimal dependencies (just sdv + pandas)