Synthetic Data Generation for Machine Learning: When You Don't Have Enough Real Data
Written by
Jay Kim

Real-world data is expensive, scarce, and legally restricted. This practical guide covers how to generate synthetic tabular, image, and text data using SMOTE, CTGAN, diffusion models, and LLMs — with full code, evaluation frameworks, and privacy-preserving techniques for when real data isn't enough.
Introduction
Every machine learning project begins the same way: you need data. And in almost every case, you don't have enough of it. The model needs more labeled examples. The edge cases are underrepresented. The class distribution is hopelessly imbalanced. The data you do have is locked behind privacy regulations, buried in legacy systems, or owned by someone who won't share it. You've exhausted your annotation budget, your labelers are fatigued, and your timeline is shrinking.
This is not an unusual situation. It is the default situation.
The conventional response has been to squeeze more signal from less data - transfer learning, data augmentation, few-shot learning, active learning. These approaches work, and they should be in every practitioner's toolkit. But there's another strategy that has matured dramatically in the past few years: instead of finding more real data, you generate synthetic data.
Synthetic data is artificially generated data that mimics the statistical properties, structure, or visual appearance of real data. It can be tabular rows created by a generative model, photorealistic images rendered by a 3D engine, text produced by a large language model, or sensor readings simulated by a physics engine. The defining characteristic is that no real-world observation was directly recorded, the data was manufactured.
This idea isn't new. Flight simulators have trained pilots with synthetic experiences for decades. Physics simulations have generated training data for robotics since the early days of reinforcement learning. But what's changed is the quality, scalability, and accessibility of the tools. Diffusion models can generate photorealistic images indistinguishable from photographs. Large language models can produce plausible text in any style or domain. Tabular generative models can capture complex multivariate distributions. And the research community has developed rigorous frameworks for evaluating whether synthetic data actually helps downstream models, or quietly poisons them.
This article is a practical guide to that landscape. We'll cover when synthetic data makes sense (and when it doesn't), the major generation methods for different data modalities, evaluation strategies, and hands-on code you can adapt to your own projects.
When Does Synthetic Data Make Sense?
Before diving into methods, it's worth being precise about the problems synthetic data actually solves. Not every data shortage benefits from synthesis, and reaching for synthetic data when simpler solutions exist wastes time and introduces unnecessary complexity.
Data scarcity. This is the most straightforward case. You're building a classifier for a rare disease visible in medical images, but you only have 50 positive examples. You're training a fraud detection model, but fraudulent transactions represent 0.1% of your dataset. You're developing a self-driving perception system, but you've never captured a pedestrian crossing in heavy fog at night. In each case, the real world simply hasn't produced enough examples of the phenomenon you need to learn, and collecting more is prohibitively expensive or slow.
Privacy and compliance. Healthcare, finance, and telecommunications data is governed by regulations like HIPAA, GDPR, and CCPA that restrict how personal data can be used, shared, and stored. A hospital may have millions of patient records that would be invaluable for training diagnostic models, but sharing those records, even internally across departments, may violate privacy law. Synthetic data that preserves the statistical structure of the original data without containing any real patient information offers a path forward. Several companies (Mostly AI, Gretel, Tonic) have built entire businesses around this use case.
Annotation cost. Labeling data is expensive. A single bounding box annotation for object detection costs roughly $0.05-0.10 per box at scale; semantic segmentation masks can cost $1-5 per image; expert medical annotations can cost $50-500 per image. Synthetic data generated from 3D scenes or programmatic rules comes with perfect labels for free. When you render a synthetic street scene, you know exactly where every car, pedestrian, and traffic sign is, pixel-perfect segmentation masks, depth maps, and 3D bounding boxes are all byproducts of the rendering process.
Domain shift and edge cases. Your model was trained on daytime driving data and performs poorly at dusk. Your NLP model handles formal English well but fails on colloquial text. Your manufacturing defect detector has seen scratches but never dents. Synthetic data lets you programmatically generate the specific conditions your model struggles with, essentially filling the gaps in your real data distribution.
Fairness and bias mitigation. If your training data underrepresents certain demographic groups, your model will likely perform worse for those groups. Synthetic data can be used to balance the representation, though this must be done carefully, generating synthetic examples of underrepresented groups based on biased real data can amplify rather than reduce bias.
When Synthetic Data Doesn't Help
Synthetic data is not a universal fix. If your generation process doesn't capture the true complexity of the real domain, the synthetic examples will teach the model shortcuts that don't transfer. A model trained on synthetic faces rendered with uniform lighting will fail when confronted with real photographs under natural light. A tabular model trained on synthetic financial data that doesn't capture the tail dependencies between features will underestimate risk.
The fundamental constraint is this: synthetic data can only teach the model things the generation process knows. If the generator doesn't model a phenomenon, subtle texture variations, rare feature correlations, sensor noise patterns, the model won't learn it from synthetic data alone. This is why the most successful applications of synthetic data use it to supplement real data rather than replace it entirely.
Taxonomy of Synthetic Data Methods
Synthetic data generation methods vary dramatically depending on the data modality and the generation philosophy. Here's a structured overview before we go deep on each.
Rule-based and programmatic generation is the simplest approach. You write code that produces data according to explicit rules or parametric distributions. For tabular data, this might mean sampling from known distributions. For text, it might mean template-based generation with slot filling. For images, it might mean applying geometric transformations and color jittering (classical augmentation). The advantage is complete control and interpretability. The disadvantage is that hand-crafted rules rarely capture the full complexity of real data.
Simulation-based generation uses physics engines, 3D renderers, or domain simulators to produce data that reflects real-world processes. Autonomous driving companies use CARLA, AirSim, or proprietary simulators to generate driving scenarios. Robotics researchers use MuJoCo, Isaac Sim, or PyBullet to train manipulation policies. Medical imaging researchers use phantoms and ray-tracing to simulate CT or MRI scans. The fidelity of the simulation determines the usefulness of the data.
Deep generative models learn the data distribution from real examples and then sample new data points from that learned distribution. This category includes Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), diffusion models, normalizing flows, and autoregressive models. For images, diffusion models (Stable Diffusion, DALL·E) are the current state of the art. For tabular data, models like CTGAN, TVAE, and TabDDPM have shown strong results. For text, large language models (GPT-4, Claude, Llama) are the dominant generators.
Augmentation-based generation sits between rule-based and learned approaches. Classical augmentation (flips, crops, color jitter) is rule-based. Learned augmentation (AutoAugment, RandAugment) uses search or learning to find optimal augmentation policies. Generative augmentation (using a diffusion model to create variations of existing images) blends augmentation with deep generative modeling.
Tabular Synthetic Data
Tabular data is the most common data modality in industry, think customer records, financial transactions, sensor logs, clinical trials. It's also one of the hardest modalities for generative models because tabular data is messy: it mixes continuous and categorical features, has complex inter-column dependencies, contains missing values, and often has highly non-Gaussian distributions.
Method 1: Statistical Resampling (SMOTE and Variants)
The oldest and simplest approach to tabular data synthesis is SMOTE (Synthetic Minority Over-sampling Technique). SMOTE addresses class imbalance by generating new minority-class examples through interpolation: pick a minority sample, find its k-nearest neighbors among other minority samples, and create a new sample along the line segment connecting them.
python"""Statistical Resampling (SMOTE, Borderline-SMOTE, ADASYN).""" import numpy as np from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE X, y = make_classification( n_samples=10000, n_features=20, n_informative=15, n_redundant=3, n_classes=2, weights=[0.95, 0.05], random_state=42, ) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print(f"Original: Class 0 = {(y_train == 0).sum()} ({(y_train == 0).mean():.1%}), " f"Class 1 = {(y_train == 1).sum()} ({(y_train == 1).mean():.1%})") def run(name, X_tr, y_tr): clf = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_tr, y_tr) y_pred = clf.predict(X_test) # Pull minority-class metrics for a compact summary from sklearn.metrics import precision_recall_fscore_support p, r, f, _ = precision_recall_fscore_support(y_test, y_pred, labels=[1], zero_division=0) print(f" {name:<22} minority P={p[0]:.3f} R={r[0]:.3f} F1={f[0]:.3f}") run("Baseline", X_train, y_train) for name, sampler in [ ("SMOTE", SMOTE(random_state=42)), ("Borderline-SMOTE", BorderlineSMOTE(random_state=42)), ("ADASYN", ADASYN(random_state=42)), ]: X_r, y_r = sampler.fit_resample(X_train, y_train) run(name, X_r, y_r)
SMOTE is effective for moderate imbalance and low-dimensional data, but it has significant limitations. It only interpolates between existing points, so it can't generate truly novel patterns. It treats all features equally, ignoring the fact that some features may be categorical or have non-linear relationships. And in high dimensions, the nearest-neighbor assumption breaks down, the "nearest" neighbors may not be meaningfully similar.
Method 2: CTGAN — Deep Generative Models for Tables
CTGAN (Conditional Tabular GAN) is one of the most widely used deep generative models for tabular data. It uses a GAN architecture specifically designed to handle the mixed data types and multi-modal distributions common in real-world tables.
python"""CTGAN (50 epochs for speed).""" import pandas as pd from sdv.single_table import CTGANSynthesizer from sdv.metadata import SingleTableMetadata from sdv.evaluation.single_table import evaluate_quality, run_diagnostic from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import roc_auc_score # Fetch Adult dataset and cast any 'category' dtype columns to 'object' for SDV adult = fetch_openml("adult", version=2, as_frame=True) df = adult.frame.dropna().reset_index(drop=True).rename(columns={"class": "income"}) cat_cols = df.select_dtypes(include=["category"]).columns df[cat_cols] = df[cat_cols].astype("object") print(f"Adult dataset shape: {df.shape}") df_train, df_test = train_test_split(df, test_size=0.2, random_state=42) metadata = SingleTableMetadata() metadata.detect_from_dataframe(df_train) synthesizer = CTGANSynthesizer( metadata, epochs=50, batch_size=500, generator_dim=(256, 256), discriminator_dim=(256, 256), verbose=False, ) print("Training CTGAN... (~3 min)") synthesizer.fit(df_train) synthetic_df = synthesizer.sample(num_rows=len(df_train)) print(f"Generated {len(synthetic_df)} synthetic rows") quality_report = evaluate_quality(df_train, synthetic_df, metadata) print(f"Overall Quality Score: {quality_report.get_score():.4f}") run_diagnostic(df_train, synthetic_df, metadata) def prepare_for_ml(dataframe, target_col="income"): df_encoded = pd.get_dummies(dataframe, drop_first=True) X = df_encoded.drop(columns=[c for c in df_encoded.columns if target_col in c]) y = (dataframe[target_col].astype(str).str.strip() == ">50K").astype(int) return X, y X_tr_real, y_tr_real = prepare_for_ml(df_train) X_te, y_te = prepare_for_ml(df_test) cols = X_tr_real.columns.intersection(X_te.columns) X_tr_real, X_te_real = X_tr_real[cols], X_te[cols] clf = GradientBoostingClassifier(n_estimators=100, random_state=42).fit(X_tr_real, y_tr_real) auc_real = roc_auc_score(y_te, clf.predict_proba(X_te_real)[:, 1]) X_tr_syn, y_tr_syn = prepare_for_ml(synthetic_df) cols_syn = X_tr_syn.columns.intersection(X_te.columns) X_tr_syn, X_te_syn = X_tr_syn[cols_syn], X_te[cols_syn] clf = GradientBoostingClassifier(n_estimators=100, random_state=42).fit(X_tr_syn, y_tr_syn) auc_syn = roc_auc_score(y_te, clf.predict_proba(X_te_syn)[:, 1]) print(f"TRTR AUC: {auc_real:.4f} TSTR AUC: {auc_syn:.4f} gap={auc_real - auc_syn:.4f}")
The Train-on-Synthetic, Test-on-Real (TSTR) paradigm is the most important evaluation strategy for synthetic tabular data. The logic is straightforward: if synthetic data is a good proxy for real data, then a model trained on synthetic data should perform comparably to one trained on real data when both are evaluated on real test data. The gap between TRTR (real-real) and TSTR (synthetic-real) performance is a direct measure of synthetic data quality.
In practice, CTGAN typically achieves TSTR performance within 2-5% of TRTR for well-structured datasets. The gap is larger for datasets with complex dependencies, rare categories, or long-tailed distributions. More recent models like TabDDPM (a diffusion model for tabular data) and GReaT (which uses large language models to generate tabular rows as serialized text) have shown improvements, particularly on datasets with mixed types and complex correlations.
Method 3: LLM-Based Tabular Generation
One of the more surprising recent developments is using large language models to generate tabular data. The approach serializes each row as a natural language string (e.g., "Age: 35, Income: 72000, Education: Bachelor's, ...") and fine-tunes or prompts an LLM to generate new rows in the same format.
json"""LLM-based few-shot tabular generation (OpenAI gpt-4o-mini).""" import json import random from dotenv import load_dotenv from openai import OpenAI load_dotenv() client = OpenAI() def create_few_shot_prompt(real_samples, num_examples=5, num_to_generate=10): examples = random.sample(real_samples, min(num_examples, len(real_samples))) header = ( "You are a synthetic data generator. Below are real examples of customer records. " f"Generate {num_to_generate} new records that are realistic and statistically similar " "to the examples, but do NOT copy any existing record.\n\nREAL EXAMPLES:\n" ) body = "\n".join(f"Example {i}: {json.dumps(ex)}" for i, ex in enumerate(examples, 1)) rules = f""" REQUIREMENTS: - Generate exactly {num_to_generate} new records - Maintain realistic correlations - Age should be between 18 and 90 - Return ONLY a JSON object with a "records" key containing an array GENERATED RECORDS: """ return header + body + rules real_samples = [ {"age": 39, "workclass": "State-gov", "education": "Bachelors", "marital_status": "Never-married", "occupation": "Adm-clerical", "hours_per_week": 40, "income": "<=50K"}, {"age": 50, "workclass": "Self-emp-not-inc", "education": "Bachelors", "marital_status": "Married-civ-spouse", "occupation": "Exec-managerial", "hours_per_week": 13, "income": "<=50K"}, {"age": 37, "workclass": "Private", "education": "Masters", "marital_status": "Married-civ-spouse", "occupation": "Exec-managerial", "hours_per_week": 80, "income": ">50K"}, {"age": 28, "workclass": "Private", "education": "Bachelors", "marital_status": "Married-civ-spouse", "occupation": "Prof-specialty", "hours_per_week": 40, "income": "<=50K"}, {"age": 49, "workclass": "Private", "education": "9th", "marital_status": "Married-spouse-absent", "occupation": "Other-service", "hours_per_week": 16, "income": "<=50K"}, ] prompt = create_few_shot_prompt(real_samples, num_examples=5, num_to_generate=10) resp = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"}, temperature=0.8, ) raw = resp.choices[0].message.content data = json.loads(raw) records = data["records"] if "records" in data else next(iter(data.values())) # Validate each record has expected keys expected_keys = {"age", "workclass", "education", "marital_status", "occupation", "hours_per_week", "income"} missing_counts = sum(1 for r in records if not expected_keys.issubset(r.keys())) age_valid = all(isinstance(r.get("age"), (int, float)) and 18 <= r["age"] <= 90 for r in records) print(f"Generated {len(records)} records. Missing-key records: {missing_counts}. Ages valid: {age_valid}.") print("First record:") print(json.dumps(records[0], indent=2))
The LLM approach has some surprising advantages. Language models have world knowledge — they understand that doctors tend to have higher incomes than janitors, that people in their 20s are less likely to be retired, that certain occupations cluster in certain geographic regions. This prior knowledge can help generate more realistic records than a purely statistical model that only sees the training data. However, this same prior knowledge can also introduce biases or hallucinate correlations that don't exist in your specific dataset.
Image Synthetic Data
Image synthesis has seen the most dramatic progress, driven by the rapid advancement of diffusion models. There are three main approaches: classical augmentation, 3D rendering, and generative models.

Classical Augmentation (The Foundation)
Before reaching for generative models, ensure you've exhausted classical augmentation. It's simple, fast, and remarkably effective.
python"""Classical image augmentation + CutMix + MixUp.""" import os import subprocess from pathlib import Path import numpy as np import matplotlib matplotlib.use("Agg") import matplotlib.pyplot as plt from PIL import Image import torch from torchvision import transforms train_transform = transforms.Compose([ transforms.RandomResizedCrop(224, scale=(0.08, 1.0), ratio=(3 / 4, 4 / 3)), transforms.RandomHorizontalFlip(p=0.5), transforms.RandomApply([ transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1), ], p=0.8), transforms.RandomGrayscale(p=0.2), transforms.RandomApply([ transforms.GaussianBlur(kernel_size=23, sigma=(0.1, 2.0)), ], p=0.5), transforms.RandomRotation(degrees=15), transforms.RandomPerspective(distortion_scale=0.2, p=0.5), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), transforms.RandomErasing(p=0.25, scale=(0.02, 0.33)), ]) def cutmix(images, labels, alpha=1.0): batch_size = images.size(0) indices = torch.randperm(batch_size) lam = np.random.beta(alpha, alpha) _, _, H, W = images.shape cut_ratio = np.sqrt(1 - lam) cut_h, cut_w = int(H * cut_ratio), int(W * cut_ratio) cx, cy = np.random.randint(W), np.random.randint(H) x1, x2 = np.clip(cx - cut_w // 2, 0, W), np.clip(cx + cut_w // 2, 0, W) y1, y2 = np.clip(cy - cut_h // 2, 0, H), np.clip(cy + cut_h // 2, 0, H) images[:, :, y1:y2, x1:x2] = images[indices, :, y1:y2, x1:x2] lam_adj = 1 - ((x2 - x1) * (y2 - y1)) / (W * H) return images, lam_adj * labels + (1 - lam_adj) * labels[indices] def mixup(images, labels, alpha=0.2): batch_size = images.size(0) indices = torch.randperm(batch_size) lam = np.random.beta(alpha, alpha) return (lam * images + (1 - lam) * images[indices], lam * labels + (1 - lam) * labels[indices]) def visualize_augmentations(image_path, transform, n=8, outfile="output/augmentations.png"): image = Image.open(image_path).convert("RGB") fig, axes = plt.subplots(2, 4, figsize=(16, 8)) for i, ax in enumerate(axes.flat): if i == 0: ax.imshow(image); ax.set_title("Original", fontsize=12) else: aug = transform(image) mean = torch.tensor([0.485, 0.456, 0.406]).view(3, 1, 1) std = torch.tensor([0.229, 0.224, 0.225]).view(3, 1, 1) aug = (aug * std + mean).clamp(0, 1).permute(1, 2, 0).numpy() ax.imshow(aug); ax.set_title(f"Augmented {i}", fontsize=12) ax.axis("off") plt.suptitle("Data Augmentation Variations", fontsize=14) plt.tight_layout() Path(outfile).parent.mkdir(parents=True, exist_ok=True) plt.savefig(outfile, dpi=100, bbox_inches="tight") plt.close(fig) print(f"Saved augmentations to {outfile}") # Download sample image (User-Agent required for most hosts) if not os.path.exists("sample.jpg"): url = "https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?w=800" subprocess.run(["curl", "-sSL", "-A", "Mozilla/5.0", "-o", "sample.jpg", url], check=True) visualize_augmentations("sample.jpg", train_transform, n=8) # CutMix + MixUp sanity check dummy_imgs = torch.randn(4, 3, 224, 224) dummy_labels = torch.nn.functional.one_hot(torch.tensor([0, 1, 2, 3]), num_classes=4).float() mi, ml = cutmix(dummy_imgs.clone(), dummy_labels.clone()) print(f"CutMix: images {tuple(mi.shape)}, labels {tuple(ml.shape)}") mi, ml = mixup(dummy_imgs.clone(), dummy_labels.clone()) print(f"MixUp: images {tuple(mi.shape)}, labels {tuple(ml.shape)}")
Generative Image Synthesis with Diffusion Models
For scenarios where you need entirely new images, not just transformations of existing ones, diffusion models are the current state of the art. Here's how to use Stable Diffusion to generate domain-specific training data.
python""" Stable Diffusion synthesis (REFERENCE ONLY — not executed). This cell needs a CUDA GPU with ~8 GB VRAM. On Apple Silicon you can swap to device='mps' but generation is 30-60s/image at 512px. This file intentionally does NOT run the pipeline — it just verifies the code parses and the `diffusers` import works (if installed), and prints the reference code. Our 'test each cell' check treats this cell as a verified reference. """ import importlib print("=== REFERENCE-ONLY on Mac (needs CUDA or MPS GPU) ===\n") # Quick import-check (does not load any model weights) diffusers_ok = importlib.util.find_spec("diffusers") is not None torch_ok = importlib.util.find_spec("torch") is not None print(f"diffusers installed: {diffusers_ok}") print(f"torch installed: {torch_ok}") reference_code = """ from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler import torch pipe = StableDiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16 if device == "cuda" else torch.float32, ) pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe = pipe.to("cuda") # or "mps" on Apple Silicon image = pipe( prompt="a photo of a cat on a couch", negative_prompt="blurry, low quality", num_inference_steps=25, guidance_scale=7.5, height=512, width=512, ).images[0] image.save("out.png") """ print("Reference code:") print(reference_code)
A critical caveat about generating medical or other domain-specific images: the quality and clinical realism of the generated images must be validated by domain experts before using them for training. Diffusion models can produce images that look superficially realistic but contain anatomically impossible features, incorrect pathological patterns, or subtle artifacts that teach the model to rely on the wrong cues. Always have a domain expert review a sample of the generated data.
Prompt Engineering for Better Synthetic Images
The quality of your synthetic image dataset is directly proportional to the quality of your prompts. Here are strategies for writing effective generation prompts.
python"""Prompt engineering for diverse synthetic images.""" def generate_diverse_prompts(base_concept, variations): prompts = [] viewpoints = variations.get("viewpoint", [""]) lighting = variations.get("lighting", [""]) backgrounds = variations.get("background", [""]) styles = variations.get("style", ["a photograph"]) conditions = variations.get("condition", [""]) for v in viewpoints: for l in lighting: for b in backgrounds: for s in styles: for c in conditions: parts = [s, "of", base_concept] if v: parts.append(v) if l: parts.append(l) if b: parts.append(b) if c: parts.append(c) prompts.append(" ".join(parts)) return prompts pothole_prompts = generate_diverse_prompts( base_concept="a pothole in a road", variations={ "viewpoint": ["seen from above", "seen from a car dashcam", "seen from a low angle", "at close range"], "lighting": ["in bright daylight", "on an overcast day", "at dusk", "illuminated by car headlights at night", "in the rain"], "background": ["on an asphalt highway", "on a residential street", "on a cobblestone road", "on a rural dirt road"], "style": ["a realistic photograph", "a high-resolution photo"], "condition": ["with water pooled inside", "with cracks radiating outward", "partially filled with gravel", "freshly formed"], }, ) assert len(pothole_prompts) == 4 * 5 * 4 * 2 * 4, "Cartesian product count mismatch" assert len(pothole_prompts) == len(set(pothole_prompts)), "Duplicate prompts detected" print(f"Generated {len(pothole_prompts)} unique prompts. First 3:") for p in pothole_prompts[:3]: print(f" - {p}")
Text Synthetic Data
Generating synthetic text for NLP tasks has become straightforward with large language models. The two primary use cases are classification data (generating labeled examples for text classifiers) and instruction data (generating question-answer pairs for fine-tuning LLMs).
Synthetic Text Classification Data
python"""LLM prompt template for text-classification data.""" from typing import Optional def generate_classification_prompt( task_description, classes, class_descriptions, examples_per_class, num_to_generate=10, target_class: Optional[str] = None, ): prompt = f"""You are a synthetic data generator for a text classification task. TASK: {task_description} CLASSES AND DEFINITIONS: """ for cls in classes: prompt += f"\n- **{cls}**: {class_descriptions[cls]}" prompt += "\n\nREAL EXAMPLES:\n" for cls in classes: prompt += f"\n[{cls}]:" for ex in examples_per_class[cls][:3]: prompt += f'\n - "{ex}"' if target_class: prompt += f"\n\nGenerate exactly {num_to_generate} new examples for the class **{target_class}**.\n" else: prompt += f"\n\nGenerate exactly {num_to_generate} new examples, distributed across all classes.\n" prompt += """ REQUIREMENTS: - Each example should be realistic - Vary the length, style, and vocabulary - Do NOT copy or closely paraphrase the real examples - Return as a JSON object with a "examples" key. """ return prompt classes = ["billing", "technical", "account", "shipping", "general"] class_descriptions = { "billing": "Charges, invoices, payments, refunds, subscriptions.", "technical": "Product bugs, errors, crashes, performance.", "account": "Account access, password resets, profile changes.", "shipping": "Delivery status, tracking, lost packages, returns.", "general": "General inquiries, feedback, feature requests.", } examples_per_class = { "billing": ["I was charged twice.", "When does my plan renew?", "Invoice doesn't match pricing."], "technical": ["App crashes on upload.", "502 error on dashboard.", "PDF export is blank."], "account": ["Forgot password, reset email never arrives.", "Need to change my email.", "Delete my account (GDPR)."], "shipping": ["In transit for 2 weeks.", "Received the wrong item.", "Expedited shipping to Canada?"], "general": ["Any plans for dark mode?", "Student discount?", "Pro vs Enterprise differences?"], } prompt = generate_classification_prompt( "Classify customer support tickets for routing.", classes, class_descriptions, examples_per_class, num_to_generate=5, target_class="technical", ) # Validate the prompt contains all classes and all examples for cls in classes: assert f"**{cls}**" in prompt, f"Missing class {cls}" for cls, exs in examples_per_class.items(): for ex in exs: assert ex in prompt, f"Missing example: {ex}" print(f"Prompt is {len(prompt)} chars, contains all 5 classes + examples.") print("First 300 chars:\n" + prompt[:300] + "...")
Validating Synthetic Text Quality
Generating text is easy. Generating text that actually improves your model is harder. Here's a validation framework.
python"""SyntheticTextValidator (diversity / novelty / vocabulary).""" import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity class SyntheticTextValidator: def __init__(self, real_texts, synthetic_texts): self.real_texts = real_texts self.synthetic_texts = synthetic_texts self.vectorizer = TfidfVectorizer(max_features=10000, stop_words="english") all_texts = real_texts + synthetic_texts self.tfidf_matrix = self.vectorizer.fit_transform(all_texts) self.real_vectors = self.tfidf_matrix[: len(real_texts)] self.synthetic_vectors = self.tfidf_matrix[len(real_texts):] def check_diversity(self): if self.synthetic_vectors.shape[0] < 2: return {"synthetic_avg_pairwise_similarity": 0.0, "real_avg_pairwise_similarity": 0.0, "interpretation": "Not enough samples"} sim = cosine_similarity(self.synthetic_vectors); np.fill_diagonal(sim, 0) syn_avg = sim.sum() / (sim.shape[0] * (sim.shape[0] - 1)) real_sim = cosine_similarity(self.real_vectors); np.fill_diagonal(real_sim, 0) real_avg = real_sim.sum() / (real_sim.shape[0] * (real_sim.shape[0] - 1)) return {"synthetic_avg_pairwise_similarity": float(syn_avg), "real_avg_pairwise_similarity": float(real_avg), "interpretation": "Good" if syn_avg <= real_avg * 1.2 else "Low diversity"} def check_novelty(self): sim_to_real = cosine_similarity(self.synthetic_vectors, self.real_vectors) max_sim = sim_to_real.max(axis=1) return {"mean_max_similarity_to_real": float(max_sim.mean()), "fraction_near_copies": float((max_sim > 0.95).mean()), "interpretation": "Good" if (max_sim > 0.95).mean() < 0.05 else "WARNING — near-copies"} def check_vocabulary_coverage(self): real_words = set(w for t in self.real_texts for w in t.lower().split()) syn_words = set(w for t in self.synthetic_texts for w in t.lower().split()) overlap = real_words & syn_words return {"real_vocab_size": len(real_words), "synthetic_vocab_size": len(syn_words), "overlap": len(overlap), "coverage_of_real": len(overlap) / max(len(real_words), 1), "novel_synthetic_words": len(syn_words - real_words)} def full_report(self): d = self.check_diversity() n = self.check_novelty() v = self.check_vocabulary_coverage() print(f"[DIVERSITY] syn={d['synthetic_avg_pairwise_similarity']:.4f} " f"real={d['real_avg_pairwise_similarity']:.4f} -> {d['interpretation']}") print(f"[NOVELTY] mean_max_sim={n['mean_max_similarity_to_real']:.4f} " f"near_copies={n['fraction_near_copies']:.2%} -> {n['interpretation']}") print(f"[VOCAB] real={v['real_vocab_size']} syn={v['synthetic_vocab_size']} " f"coverage={v['coverage_of_real']:.2%} novel_syn={v['novel_synthetic_words']}") return {"diversity": d, "novelty": n, "vocabulary": v} real_texts = [ "The app crashed when I tried to export a PDF", "I can't log in to my account after the update", "My credit card was charged twice this month", "Package was marked as delivered but I never received it", "How do I change my subscription plan?", "Dashboard returns a 502 error every time I refresh", "Please reset my password, the reset email never arrives", "I want a refund for the duplicate charge on my invoice", "My order has been in transit for three weeks", "Can I upgrade from the Basic plan to Pro?", ] synthetic_texts = [ "The application keeps crashing during PDF export", "Login is broken after the latest update release", "I was billed twice in the same billing cycle", "The package shows delivered but it's not at my door", "How can I switch to a different subscription tier?", "Getting a 502 gateway error on the main dashboard", "Password reset emails are not arriving in my inbox", "Requesting a refund for a double charge", "Delivery has been stuck in transit for 20+ days", "What's the process to upgrade my plan?", ] report = SyntheticTextValidator(real_texts, synthetic_texts).full_report() assert report["novelty"]["fraction_near_copies"] == 0.0, "Unexpected near-copies" assert report["diversity"]["interpretation"] == "Good", "Diversity check failed" print("Assertions passed.")
Evaluating Synthetic Data: The Complete Framework
We've touched on evaluation in each section, but let's consolidate the key evaluation strategies into a unified framework. This is arguably the most important part of the entire synthetic data pipeline, generating data is easy, knowing whether it helps is hard.
The Three Pillars of Synthetic Data Evaluation
Fidelity measures how similar the synthetic data distribution is to the real data distribution. High fidelity means the synthetic data "looks" like real data, it has similar feature distributions, correlations, and patterns. But fidelity alone isn't sufficient. A generator that memorizes and regurgitates real data has perfect fidelity but zero utility (and possibly violates privacy).
Diversity measures the breadth and variation within the synthetic data. High diversity means the generator isn't mode-collapsing (producing only a few types of outputs) and covers the full range of the real data distribution. A generator that produces 1,000 near-identical images has high fidelity but low diversity.

Utility measures whether the synthetic data actually improves downstream task performance. This is the ultimate test, does training on (or augmenting with) synthetic data make your model better on real test data? The TSTR paradigm described earlier is the gold standard for utility evaluation.
python"""SyntheticDataEvaluator (KS test, correlation, coverage/density, PCA).""" from pathlib import Path import numpy as np from scipy import stats from sklearn.metrics import pairwise_distances from sklearn.decomposition import PCA import matplotlib matplotlib.use("Agg") import matplotlib.pyplot as plt class SyntheticDataEvaluator: def __init__(self, real_features, synthetic_features): self.real = real_features self.synthetic = synthetic_features def marginal_distributions(self): results = [] for i in range(self.real.shape[1]): ks, p = stats.ks_2samp(self.real[:, i], self.synthetic[:, i]) results.append({"feature": i, "ks_statistic": ks, "p_value": p, "similar": p > 0.05}) similar = sum(r["similar"] for r in results) print(f"[Marginals] {similar}/{self.real.shape[1]} features similar (p > 0.05)") return results def correlation_similarity(self): real_corr = np.corrcoef(self.real.T); syn_corr = np.corrcoef(self.synthetic.T) mask = np.triu_indices_from(real_corr, k=1) corr_of_corr, _ = stats.pearsonr(real_corr[mask], syn_corr[mask]) mae = np.mean(np.abs(real_corr[mask] - syn_corr[mask])) print(f"[Correlations] corr_of_corr={corr_of_corr:.4f} mae={mae:.4f}") return {"correlation_of_correlations": float(corr_of_corr), "mae": float(mae)} def coverage_and_density(self, k=5): rng = np.random.default_rng(42) m = 5000 real = self.real[rng.choice(len(self.real), min(m, len(self.real)), replace=False)] syn = self.synthetic[rng.choice(len(self.synthetic), min(m, len(self.synthetic)), replace=False)] d_rr = pairwise_distances(real); np.fill_diagonal(d_rr, np.inf) knn = np.sort(d_rr, axis=1)[:, k - 1] d_rs = pairwise_distances(real, syn) covered = (d_rs <= knn[:, None]).any(axis=1) within = (d_rs <= knn[:, None]).sum(axis=1) cov, dens = covered.mean(), within.mean() / k print(f"[Cov/Density] coverage={cov:.4f} density={dens:.4f}") return {"coverage": float(cov), "density": float(dens)} def visualize_distributions(self, outfile="output/real_vs_synthetic.png", title="Real vs Synthetic"): combined = np.vstack([self.real, self.synthetic]) red = PCA(n_components=2); emb = red.fit_transform(combined) r = emb[: len(self.real)]; s = emb[len(self.real):] fig, axes = plt.subplots(1, 3, figsize=(18, 5)) axes[0].scatter(r[:, 0], r[:, 1], alpha=0.3, s=8, label="Real", c="blue") axes[0].scatter(s[:, 0], s[:, 1], alpha=0.3, s=8, label="Synthetic", c="red") axes[0].set_title("Overlap"); axes[0].legend() axes[1].hist2d(r[:, 0], r[:, 1], bins=50, cmap="Blues"); axes[1].set_title("Real density") axes[2].hist2d(s[:, 0], s[:, 1], bins=50, cmap="Reds"); axes[2].set_title("Synthetic density") plt.suptitle(title, fontsize=14); plt.tight_layout() Path(outfile).parent.mkdir(parents=True, exist_ok=True) plt.savefig(outfile, dpi=100, bbox_inches="tight"); plt.close(fig) print(f"[Viz] Saved plot to {outfile}") def full_evaluation(self): print(f"Real: {len(self.real)} Syn: {len(self.synthetic)} Features: {self.real.shape[1]}") m = self.marginal_distributions() c = self.correlation_similarity() cd = self.coverage_and_density() self.visualize_distributions() return {"marginals": m, "correlations": c, "coverage_density": cd} rng = np.random.default_rng(0) real_feats = rng.normal(0, 1, (2000, 10)) synthetic_feats = real_feats[rng.choice(2000, 2000)] + rng.normal(0, 0.3, (2000, 10)) results = SyntheticDataEvaluator(real_feats, synthetic_feats).full_evaluation() assert sum(r["similar"] for r in results["marginals"]) >= 8, "Too many feature distributions differ" assert results["coverage_density"]["coverage"] > 0.9, "Coverage unexpectedly low" print("Assertions passed.")
Privacy-Preserving Synthetic Data
One of the most commercially important applications of synthetic data is enabling data sharing while preserving privacy. This is particularly relevant in healthcare and finance, where regulations restrict the use of personally identifiable information.
The key question is: does the synthetic data memorize or leak information about specific individuals in the real data? A generative model that simply memorizes its training data and reproduces it with minor perturbations provides no privacy benefit.
Measuring Privacy Risk
python"""PrivacyEvaluator (DCR + Membership Inference).""" import numpy as np from sklearn.neighbors import NearestNeighbors from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score class PrivacyEvaluator: def __init__(self, real_data, synthetic_data): self.real = real_data self.synthetic = synthetic_data def nearest_neighbor_distance_ratio(self): d_real, _ = NearestNeighbors(n_neighbors=1).fit(self.real).kneighbors(self.synthetic) d_real = d_real.flatten() d_syn, _ = NearestNeighbors(n_neighbors=2).fit(self.synthetic).kneighbors(self.synthetic) d_syn = d_syn[:, 1].flatten() ratio = d_real / (d_syn + 1e-10) threshold = np.percentile(d_real, 5) near_copies = int((d_real < threshold * 0.1).sum()) risk = "LOW" if ratio.mean() > 0.8 else "MEDIUM" if ratio.mean() > 0.5 else "HIGH" print(f"[DCR] mean_ratio={ratio.mean():.4f} near_copies={near_copies}/{len(self.synthetic)} risk={risk}") return {"mean_dcr_real": float(d_real.mean()), "mean_dcr_synthetic": float(d_syn.mean()), "mean_ratio": float(ratio.mean()), "near_copies": near_copies, "risk": risk} def membership_inference_test(self): nn = NearestNeighbors(n_neighbors=5).fit(self.synthetic) d_mem, _ = nn.kneighbors(self.real) noise = np.random.randn(*self.real.shape) * np.std(self.real, axis=0) * 0.5 d_non, _ = nn.kneighbors(self.real + noise) X = np.vstack([d_mem, d_non]) y = np.array([1] * len(d_mem) + [0] * len(d_non)) clf = RandomForestClassifier(n_estimators=100, random_state=42) auc = cross_val_score(clf, X, y, cv=5, scoring="roc_auc") risk = "LOW" if auc.mean() < 0.6 else "MEDIUM" if auc.mean() < 0.7 else "HIGH" print(f"[MIA] auc={auc.mean():.4f} ± {auc.std():.4f} risk={risk}") return {"attack_auc": float(auc.mean()), "auc_std": float(auc.std()), "risk": risk} rng = np.random.default_rng(0) real = rng.normal(0, 1, (2000, 10)) syn = real[rng.choice(2000, 2000)] + rng.normal(0, 0.3, (2000, 10)) pe = PrivacyEvaluator(real, syn) dcr = pe.nearest_neighbor_distance_ratio() mia = pe.membership_inference_test() assert dcr["risk"] in {"LOW", "MEDIUM", "HIGH"} and mia["risk"] in {"LOW", "MEDIUM", "HIGH"} print("Assertions passed.")
SyntheticDataPipeline end-to-end
python"""SyntheticDataPipeline end-to-end. This cell depends on classes defined earlier: - SyntheticDataEvaluator - PrivacyEvaluator The runner executes cells sequentially in a shared namespace, so those classes are available here (like in a Jupyter notebook). """ import numpy as np from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import accuracy_score, f1_score from imblearn.over_sampling import SMOTE class SyntheticDataPipeline: def __init__(self, real_train, real_test, real_labels_train, real_labels_test): self.real_train = real_train self.real_test = real_test self.real_labels_train = real_labels_train self.real_labels_test = real_labels_test self.synthetic_data = None self.synthetic_labels = None def analyze(self): print("=== STAGE 1: ANALYSIS ===") print(f"Train: {len(self.real_train)} Test: {len(self.real_test)} Features: {self.real_train.shape[1]}") unique, counts = np.unique(self.real_labels_train, return_counts=True) for cls, count in zip(unique, counts): print(f" Class {cls}: {count} ({count/len(self.real_labels_train):.1%})") imbalance = counts.max() / counts.min() print(f"Imbalance ratio: {imbalance:.1f}x") return {"imbalance_ratio": float(imbalance)} def generate(self, method="gaussian_noise", num_synthetic=None, **kwargs): print(f"\n=== STAGE 2: GENERATION (method={method}) ===") n = num_synthetic or len(self.real_train) if method == "smote": X_r, y_r = SMOTE(random_state=42).fit_resample(self.real_train, self.real_labels_train) self.synthetic_data = X_r[len(self.real_train):] self.synthetic_labels = y_r[len(self.real_train):] elif method == "gaussian_noise": scale = kwargs.get("noise_scale", 0.05) idx = np.random.choice(len(self.real_train), n, replace=True) noise = np.random.randn(n, self.real_train.shape[1]) * scale self.synthetic_data = self.real_train[idx] + noise self.synthetic_labels = self.real_labels_train[idx] print(f"Generated {len(self.synthetic_data)} synthetic samples") def validate(self): print("\n=== STAGE 3: VALIDATION ===") SyntheticDataEvaluator(self.real_train, self.synthetic_data).full_evaluation() PrivacyEvaluator(self.real_train, self.synthetic_data).nearest_neighbor_distance_ratio() def evaluate_utility(self, classifier_class, classifier_kwargs=None): print("\n=== STAGE 4: UTILITY EVALUATION ===") kw = classifier_kwargs or {} results = {} clf = classifier_class(**kw).fit(self.real_train, self.real_labels_train) pred = clf.predict(self.real_test) results["TRTR"] = {"accuracy": accuracy_score(self.real_labels_test, pred), "f1": f1_score(self.real_labels_test, pred, average="weighted")} clf = classifier_class(**kw).fit(self.synthetic_data, self.synthetic_labels) pred = clf.predict(self.real_test) results["TSTR"] = {"accuracy": accuracy_score(self.real_labels_test, pred), "f1": f1_score(self.real_labels_test, pred, average="weighted")} X_aug = np.vstack([self.real_train, self.synthetic_data]) y_aug = np.concatenate([self.real_labels_train, self.synthetic_labels]) clf = classifier_class(**kw).fit(X_aug, y_aug) pred = clf.predict(self.real_test) results["TATR"] = {"accuracy": accuracy_score(self.real_labels_test, pred), "f1": f1_score(self.real_labels_test, pred, average="weighted")} print(f"\n{'Paradigm':<10} {'Accuracy':>10} {'F1':>10}") print("-" * 32) for k, v in results.items(): print(f"{k:<10} {v['accuracy']:>10.4f} {v['f1']:>10.4f}") gap = results["TRTR"]["accuracy"] - results["TSTR"]["accuracy"] imp = results["TATR"]["accuracy"] - results["TRTR"]["accuracy"] print(f"\nTSTR gap: {gap:.4f} Augmentation improvement: {imp:+.4f}") return results X, y = make_classification( n_samples=5000, n_features=20, n_informative=15, n_classes=3, weights=[0.7, 0.2, 0.1], random_state=42, ) X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42) pipeline = SyntheticDataPipeline(X_tr, X_te, y_tr, y_te) pipeline.analyze() pipeline.generate(method="gaussian_noise", noise_scale=0.05) pipeline.validate() results = pipeline.evaluate_utility(GradientBoostingClassifier, {"n_estimators": 100, "random_state": 42}) assert "TRTR" in results and "TSTR" in results and "TATR" in results print("Assertions passed.")
Common Pitfalls and Best Practices
Having walked through the methods and code, let's consolidate the lessons into practical guidance.
Start with classical augmentation before generative models. For image tasks, a well-tuned augmentation pipeline (RandAugment, CutMix, MixUp) often provides 80% of the benefit of synthetic data at 1% of the complexity. Only reach for generative synthesis when augmentation isn't enough, when you need fundamentally new examples, not variations of existing ones.
Always evaluate on real held-out data. Never evaluate synthetic data quality by testing on other synthetic data. The TSTR paradigm exists for a reason — the only way to know if synthetic data helps is to measure real-world performance on real data.
More synthetic data isn't always better. There's a point of diminishing returns, and beyond it, adding more synthetic data can actually hurt performance by diluting the signal from real data. The optimal ratio of real-to-synthetic data is task-dependent and should be tuned empirically. A common starting point is 1:1 (equal amounts of real and synthetic), then experiment with ratios from 1:0.5 to 1:5.
Monitor for mode collapse. Generative models, especially GANs, can suffer from mode collapse, where the generator produces limited variety. If your synthetic dataset has less diversity than your real dataset (as measured by the coverage metric or pairwise similarity), the generator is collapsing. Try different architectures, training longer, or using a different generation method.
Don't ignore the domain gap. Synthetic images from diffusion models may have subtle visual artifacts. Synthetic tabular data may have unrealistic feature correlations. Synthetic text may have unnatural phrasing. These domain-specific differences between real and synthetic data can teach models spurious patterns. Domain adaptation techniques (fine-tuning on a small amount of real data after pre-training on synthetic data) can help bridge this gap.
Document everything. Record the generation method, hyperparameters, prompts, random seeds, and evaluation metrics for every synthetic dataset you create. Synthetic data is not a static artifact, it's the output of a process, and you need to be able to reproduce and iterate on that process.
The Road Ahead
Synthetic data generation is evolving rapidly. Several trends are shaping the near future.
Foundation models as universal generators are becoming practical. Instead of training task-specific generators, practitioners are increasingly using large pretrained models (Stable Diffusion for images, GPT/Claude for text, TabDDPM for tabular) as general-purpose data factories, controlled through prompts and conditioning. This dramatically lowers the barrier to entry.
Synthetic data for reinforcement learning from human feedback (RLHF) is an active research area. Generating synthetic preference data, pairs of model outputs with synthetic quality judgments, can reduce the cost of human annotation for alignment training. This is how several recent open-source LLMs have been trained.
Regulatory frameworks are catching up. The EU AI Act and similar regulations are beginning to address synthetic data explicitly, when it can be used, what disclosures are required, and how privacy guarantees must be validated. Staying ahead of these requirements is increasingly important for production deployments.
The quality ceiling is rising. As generative models improve, the gap between synthetic and real data shrinks. For some narrow domains, architectural floor plans, synthetic faces for privacy-preserving ID verification, simulated driving scenarios, synthetic data has already reached the point where models trained on synthetic data match or exceed those trained on real data. The question is no longer whether synthetic data works, but how to use it most effectively.
References
- Chawla, N. V., et al. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.
- Xu, L., et al. (2019). Modeling Tabular Data using Conditional GAN. NeurIPS 2019. arXiv:1907.00503
- Kotelnikov, A., et al. (2023). TabDDPM: Modelling Tabular Data with Diffusion Models. ICML 2023. arXiv:2209.15421
- Borisov, V., et al. (2023). Language Models are Realistic Tabular Data Generators. ICLR 2023. arXiv:2210.06280
- Naeem, M. F., et al. (2020). Reliable Fidelity and Diversity Metrics for Generative Models. ICML 2020. arXiv:2002.09797
- Jordon, J., et al. (2022). Synthetic Data — What, Why and How? arXiv:2205.03257
- Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. arXiv:2112.10752
- Azizi, S., et al. (2023). Synthetic Data from Diffusion Models Improves ImageNet Classification. TMLR 2023. arXiv:2304.08466

