Synthetic Data Generator using CTGAN

January 28, 2026

A production-ready Synthetic Data Generator designed to create realistic, configurable, and scalable synthetic datasets for machine learning, analytics, and testing use cases.
Generate realistic synthetic tabular data using CTGAN (Conditional Tabular Generative Adversarial Network) via the SDV (Synthetic Data Vault) library in Python.
This project demonstrates how to build an end-to-end synthetic data pipeline using Python, with extensibility toward microservices, AI workflows, and data platforms.

Load real tabular data
Train a CTGAN model
Generate high-quality synthetic data that preserves statistical properties and relationships
Compare real vs. synthetic distributions (basic checks)

Great for privacy-preserving data sharing, machine learning prototyping, testing pipelines, or augmenting small datasets.

Features

Uses modern SDV ≥1.0 API (clean metadata handling)
Supports mixed data types: numerical + categorical
Two ready-to-use scripts:
- synthesizer.py → full training + generation (more customizable)
- synthesizer_quick.py → fast generation with default settings sample data not required
Example real dataset (sample_data.csv) included
Pre-generated synthetic examples (synthesized_data_1000.csv, quick_synthesized_data_1000.csv)
Minimal dependencies (just sdv + pandas)

Open Github Repo