How to Build the Best Dataset for Code Generation AI Like Claude Code, Cursor, and Devin

April 16, 2026

Written by

Aerin Kim

How to Build the Best Dataset for Code Generation AI Like Claude Code, Cursor, and Devin

Learn how to build a high-quality dataset for code generation AI models like Claude Code, Cursor, and Devin with a complete 12-step pipeline covering filtering, deduplication, synthetic data, and evaluation.

The performance gap between a mediocre code generation model and one that reliably produces production-ready code almost always traces back to training data. Architecture innovations and compute scaling matter, but the research consistently shows that data quality is the dominant factor in final model capability. The LIMA paper from Meta AI demonstrated that a 65B parameter model fine-tuned on just 1,000 carefully curated examples could compete with models trained on 52,000 examples or more, strongly supporting what the authors called the Superficial Alignment Hypothesis: almost all knowledge is learned during pretraining, and alignment is primarily about teaching the model the right output format and style.

For code generation specifically, this finding has profound implications. Tools like Claude Code, Cursor, and Devin achieve their performance not through exotic architectures but through meticulous dataset engineering. If you are building, fine-tuning, or evaluating a code generation model in 2026, this guide walks through the complete technical pipeline for constructing a high-quality code dataset, from raw source extraction through deduplication, quality filtering, synthetic augmentation, and evaluation. We include concrete implementation details, code snippets, filtering heuristics, and references to the research that underpins each technique.

Why Data Quality Outperforms Data Quantity for Code Models

The relationship between dataset size and model performance is not linear, particularly for alignment and fine-tuning. The LIMA experiments showed that when training a 7B parameter LLaMa model on exponentially increasing amounts of quality-filtered Stack Exchange data (from 2K to 32K examples), performance as measured by ChatGPT evaluation plateaued despite a 16-fold increase in data size. The researchers found that scaling prompt diversity and output quality had measurable positive effects, while scaling quantity alone did not.

This finding aligns with what the BigCode project observed when training StarCoder. Filtering The Stack dataset for quality, deduplicating aggressively, and removing low-signal files produced significantly better benchmark scores than training on the raw corpus. The CCNet pipeline research similarly showed that duplicated training examples are pervasive in common NLP datasets and that deduplication is critical for reducing training time and eliminating bias toward frequently repeated patterns.

For code data, these effects are amplified because code has formal structure. A natural language model can tolerate noise because human language is flexible. Code is different. A single misplaced bracket, an incorrect import statement, or a deprecated API call makes the difference between a function that runs and one that throws an error. When training data contains broken, outdated, or poorly written code, the model learns to reproduce those patterns with high fidelity.

The AWS machine learning blog on dataset preparation for LLM training emphasizes that "the performance of these models is heavily influenced by the data used during the training process" and that "having a well-curated dataset is crucial for achieving optimal performance." This applies doubly to code, where correctness is binary rather than subjective.

If you are creating technical documentation or educational content around your dataset engineering process, tools like the Miraflow AI image generator can help produce clear technical diagrams and visuals from text prompts without needing a design team.

Step 1: Source Extraction and Text Preprocessing

The first stage of any code dataset pipeline mirrors the general LLM data preparation process: extracting clean text from diverse source formats. Code data comes from GitHub repositories, package registries, documentation sites, Q&A platforms, and research papers. Each source format requires specific extraction tooling.

Extracting Code from GitHub Repositories

For GitHub-sourced data, the primary challenge is efficiently cloning and processing millions of repositories while respecting rate limits and storage constraints. The BigCode project provides reference tooling for this through their dataset pipeline. A typical extraction workflow involves cloning repositories using the GitHub API, walking the file tree, and extracting file contents along with metadata.

python
import os
import json
from pathlib import Path

def extract_repo_files(repo_path, max_file_size_bytes=100_000):
    """Extract code files from a cloned repository with metadata."""
    results = []
    
    for filepath in Path(repo_path).rglob("*"):
        if not filepath.is_file():
            continue
        
        # Skip binary and non-code files
        if filepath.suffix in {'.png', '.jpg', '.gif', '.ico', '.woff', 
                                '.ttf', '.lock', '.pyc', '.so', '.dll'}:
            continue
        
        file_size = filepath.stat().st_size
        
        # Filter by size: skip files under 100 bytes or over 100KB
        if file_size < 100 or file_size > max_file_size_bytes:
            continue
            
        try:
            content = filepath.read_text(encoding='utf-8', errors='strict')
        except (UnicodeDecodeError, PermissionError):
            continue
        
        results.append({
            'filepath': str(filepath.relative_to(repo_path)),
            'language': detect_language(filepath.suffix, content),
            'size_bytes': file_size,
            'content': content,
        })
    
    return results

This approach parallels the HTML and PDF extraction techniques where the trafilatura library is used to extract and preprocess HTML content from web pages. For code, the extraction is simpler in some ways (most source files are plain text) but more complex in others (understanding file relationships within a repository requires structural analysis).

Extracting Documentation and Q&A Data

Stack Overflow and Stack Exchange data provide valuable natural-language-to-code mappings. The LIMA paper used 200 questions and answers from STEM Stack Exchange communities and 200 from other exchanges, applying temperature-based sampling (τ=3\tau = 3τ=3) to get a more uniform distribution across domains. Their filtering criteria for Stack Exchange data included selecting questions with the highest score that were self-contained in the title, requiring answers with a minimum score of 10, and filtering answers that were too short (under 1,200 characters) or too long (over 4,096 characters).

For code-specific Q&A data, a similar approach works well:

python
def filter_stackoverflow_answer(answer, question):
    """Apply quality filters to Stack Overflow Q&A pairs."""
    
    # Minimum answer score threshold
    if answer['score'] < 10:
        return False
    
    # Must contain code block
    if '<code>' not in answer['body'] and '```' not in answer['body']:
        return False
    
    # Filter very short answers (likely link-only)
    plain_text = strip_html(answer['body'])
    if len(plain_text) < 500:
        return False
    
    # Filter very long answers (likely copy-pasted documentation)
    if len(plain_text) > 8192:
        return False
    
    # Exclude answers that reference other answers
    reference_patterns = ['as mentioned', 'see above', 'other answer']
    if any(p in plain_text.lower() for p in reference_patterns):
        return False
    
    # Exclude first-person anecdotal answers
    first_person_ratio = count_first_person(plain_text) / max(word_count(plain_text), 1)
    if first_person_ratio > 0.05:
        return False
    
    return True

This filtering methodology automatically filter answers that are too short, too long, written in the first person, or reference other answers and remove links, images, and other HTML tags from the response, retaining only code blocks and lists.

For teams building educational video content about their data pipeline, visual walkthroughs of filtering logic can significantly improve comprehension for collaborators and stakeholders.

Step 2: License Filtering and Legal Compliance

Every record in your code dataset must have a traceable, permissive license. The standard approach is to filter for repositories with licenses that explicitly permit use for model training. The StarCoder project established industry precedent by including opt-out mechanisms for developers and filtering The Stack dataset to only include permissively licensed code.

python
PERMISSIVE_LICENSES = {
    'mit', 'apache-2.0', 'bsd-2-clause', 'bsd-3-clause',
    'cc0-1.0', 'unlicense', 'isc', 'artistic-2.0',
    'zlib', 'bsl-1.0', 'postgresql', 'ncsa',
}

def check_license(repo_metadata):
    """Check if repository has a permissive license."""
    license_key = repo_metadata.get('license', {}).get('spdx_id', '')
    
    if not license_key:
        return False  # No license detected, exclude
    
    return license_key.lower() in PERMISSIVE_LICENSES

Repositories without a license file should be excluded entirely. This is non-negotiable for any serious code dataset project, both for legal protection and for maintaining trust with the open-source community.

Step 3: Repository-Level Quality Filtering

Raw GitHub data contains enormous amounts of low-signal code. Student homework, abandoned projects, auto-generated boilerplate, and configuration dumps all add noise. Repository-level filtering uses metadata signals to identify high-quality sources before processing individual files.

The key quality signals at the repository level include star count, commit frequency, recency of activity, presence of test files, documentation quality, and CI/CD configuration. Each signal is imperfect individually, but their combination provides a strong quality indicator.

python
from datetime import datetime, timedelta
import math

def score_repository(repo_metadata):
    """Compute a composite quality score for a repository."""
    score = 0.0
    
    # Star count (log-scaled to reduce outlier influence)
    stars = repo_metadata.get('stargazers_count', 0)
    if stars >= 5:
        score += min(math.log2(stars), 15)  # Cap at ~32K stars
    else:
        return 0  # Hard filter: minimum 5 stars
    
    # Commit recency
    last_push = datetime.fromisoformat(
        repo_metadata['pushed_at'].replace('Z', '+00:00')
    )
    days_since_push = (datetime.now(last_push.tzinfo) - last_push).days
    if days_since_push < 365:
        score += 5
    elif days_since_push < 730:
        score += 2
    
    # Has tests
    if repo_metadata.get('has_test_directory', False):
        score += 4
    
    # Has CI/CD
    if repo_metadata.get('has_ci_config', False):
        score += 3
    
    # Has README with substantial content
    readme_length = repo_metadata.get('readme_length', 0)
    if readme_length > 500:
        score += 2
    
    # Fork penalty (forks are often duplicates)
    if repo_metadata.get('fork', False):
        score *= 0.3
    
    return score

The LIMA paper demonstrated a related principle: when comparing models trained on diverse Stack Exchange data versus homogeneous wikiHow data, the diverse source yielded significantly higher performance (3.83 vs. 3.49 on their quality metric). Applying this to code datasets means prioritizing repositories that cover diverse programming patterns, domains, and complexity levels rather than over-sampling from a narrow set of popular frameworks.

For creating visuals that explain your filtering pipeline to stakeholders or in blog posts, the YouTube thumbnail maker on Miraflow can generate clear cover images even for technical content.

Step 4: File-Level Filtering and Preprocessing

After repository-level filtering, the next stage processes individual files. This involves removing non-code files, filtering by size, detecting languages, and removing auto-generated content.

Removing Auto-Generated and Low-Signal Files

Auto-generated files are particularly damaging to code datasets because they teach the model to reproduce boilerplate patterns rather than learn generalizable programming concepts.

python
import os
from pathlib import Path

AUTO_GENERATED_MARKERS = [
    'do not edit', 'auto-generated', 'generated by', 
    'this file was generated', 'autogenerated',
    'code generated by', 'generated from',
    '// @generated', '# -*- coding: utf-8 -*-\n# Generated by',
]

SKIP_FILENAMES = {
    'package-lock.json', 'yarn.lock', 'Cargo.lock', 'poetry.lock',
    'go.sum', 'Pipfile.lock', 'composer.lock', 'Gemfile.lock',
    '.DS_Store', 'Thumbs.db',
}

SKIP_DIRECTORIES = {
    'node_modules', 'vendor', 'venv', '.venv', '__pycache__',
    'dist', 'build', '.git', '.svn', 'target', 'bin', 'obj',
}

def should_include_file(filepath, content):
    """Determine if a file should be included in the dataset."""
    
    filename = os.path.basename(filepath)
    
    # Skip known low-value files
    if filename in SKIP_FILENAMES:
        return False
    
    # Skip files in vendored/generated directories
    path_parts = set(Path(filepath).parts)
    if path_parts & SKIP_DIRECTORIES:
        return False
    
    # Check first 500 characters for auto-generation markers
    header = content[:500].lower()
    if any(marker in header for marker in AUTO_GENERATED_MARKERS):
        return False
    
    # Skip minified files (very long lines indicate minification)
    lines = content.split('\n')
    if lines and max(len(line) for line in lines[:10]) > 500:
        avg_line_length = sum(len(l) for l in lines) / max(len(lines), 1)
        if avg_line_length > 200:
            return False
    
    return True

Language Detection

GitHub's Linguist library provides language classification based on file extensions and heuristics, but a secondary pass using content analysis catches misclassified files:

python
EXTENSION_TO_LANGUAGE = {
    '.py': 'python', '.js': 'javascript', '.ts': 'typescript',
    '.java': 'java', '.cpp': 'cpp', '.cc': 'cpp', '.c': 'c',
    '.go': 'go', '.rs': 'rust', '.rb': 'ruby', '.php': 'php',
    '.swift': 'swift', '.kt': 'kotlin', '.scala': 'scala',
    '.r': 'r', '.R': 'r', '.sql': 'sql', '.sh': 'bash',
    '.lua': 'lua', '.dart': 'dart', '.jl': 'julia',
}

SHEBANG_TO_LANGUAGE = {
    'python': 'python', 'node': 'javascript', 'ruby': 'ruby',
    'bash': 'bash', 'sh': 'bash', 'perl': 'perl', 'php': 'php',
}

def detect_language(extension, content):
    """Detect programming language from extension and content."""
    
    # Primary: file extension
    lang = EXTENSION_TO_LANGUAGE.get(extension.lower())
    if lang:
        return lang
    
    # Secondary: shebang line
    first_line = content.split('\n')[0].strip()
    if first_line.startswith('#!'):
        for key, language in SHEBANG_TO_LANGUAGE.items():
            if key in first_line:
                return language
    
    return 'unknown'

Step 5: Deduplication at Scale

Deduplication is one of the highest-impact steps in the entire pipeline. The CCNet paper describes a systematic approach where 30 TB of data was divided into 1,600 shards of approximately 5 GB each, with hash-based deduplication applied at the paragraph level. For code datasets, the same principles apply but with code-specific considerations.

Exact Deduplication

The first pass removes files with identical content using SHA-256 hashing. This is computationally cheap and typically removes 10 to 30 percent of raw data.

Near-Duplicate Detection with MinHash LSH

Near-duplicate detection is critical because forking, copy-pasting, and template usage create massive amounts of nearly-identical code on GitHub. The MinHash with Locality-Sensitive Hashing (LSH) approach provides efficient approximate similarity detection at scale.

MinHash is another popular method for estimating the similarities between two paragraphs. This technique is particularly useful for large datasets because it provides an efficient approximation of the Jaccard similarity. The technique works by breaking text into shingles (overlapping sequences of tokens), applying multiple hash functions to each shingle, and using the minimum hash values as a compact signature for similarity comparison.

python
from datasketch import MinHash, MinHashLSH

def build_minhash(content, num_perm=128, ngram_size=5):
    """Build a MinHash signature for a code file."""
    mh = MinHash(num_perm=num_perm)
    
    # Normalize whitespace for code comparison
    normalized = normalize_code_whitespace(content)
    
    # Create character-level n-grams (shingles)
    tokens = normalized.split()
    for i in range(len(tokens) - ngram_size + 1):
        shingle = ' '.join(tokens[i:i + ngram_size])
        mh.update(shingle.encode('utf-8'))
    
    return mh

def near_dedup_minhash(file_records, threshold=0.8, num_perm=128):
    """Remove near-duplicate files using MinHash LSH."""
    
    lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
    unique_records = []
    duplicates_removed = 0
    
    for idx, record in enumerate(file_records):
        mh = build_minhash(record['content'], num_perm=num_perm)
        
        # Query for similar documents
        result = lsh.query(mh)
        
        if not result:
            # No near-duplicates found, add to index
            lsh.insert(f"doc_{idx}", mh)
            unique_records.append(record)
        else:
            duplicates_removed += 1
    
    print(f"Removed {duplicates_removed} near-duplicates "
          f"({duplicates_removed/len(file_records)*100:.1f}%)")
    
    return unique_records

def normalize_code_whitespace(content):
    """Normalize whitespace and comments for dedup comparison."""
    lines = content.split('\n')
    normalized_lines = []
    
    for line in lines:
        stripped = line.strip()
        if stripped and not stripped.startswith('#') \
                    and not stripped.startswith('//'):
            normalized_lines.append(stripped)
    
    return '\n'.join(normalized_lines)

The CCNet pipeline computes "the first 64 bits of SHA-1 digits of the normalized paragraphs as the key" for paragraph-level deduplication. For code, normalizing whitespace and optionally removing comments before hashing provides better near-duplicate detection, since developers frequently copy code and only change variable names or formatting.

LSH complements MinHash by providing efficient bucketing to avoid comparing every pair of documents. LSH complements MinHash by providing a way to quickly identify potential matches through bucketing and hashing techniques without having to compare every pair of items in the dataset.

The Jaccard similarity threshold should be set between 0.7 and 0.85 depending on how aggressive you want the deduplication to be. Lower thresholds remove more files but risk removing legitimately different implementations of similar functionality. A threshold of 0.8 is a reasonable starting point for most code datasets.

For teams creating video content that explains deduplication concepts to broader audiences, visual demonstrations of how MinHash LSH works can be highly effective. The Text2Shorts generator on Miraflow can turn a technical explanation into a polished short-form video.

Step 6: Code Quality Scoring

Beyond binary filtering, assigning a continuous quality score to each file allows you to weight high-quality examples more heavily during training. This is the code equivalent of what the LIMA paper found regarding quality filtering: "there is a significant 0.5 point difference between models trained on the filtered and unfiltered data sources."

Static Analysis-Based Scoring

Language-specific linting tools provide automated quality signals. For Python, tools like Ruff or Pylint can be run programmatically:

python
import subprocess
import json

def compute_python_quality_score(filepath):
    """Compute a quality score for a Python file using ruff."""
    
    score = 1.0  # Start at maximum
    
    # Run ruff for linting issues
    result = subprocess.run(
        ['ruff', 'check', '--output-format', 'json', filepath],
        capture_output=True, text=True
    )
    
    try:
        issues = json.loads(result.stdout)
    except json.JSONDecodeError:
        issues = []
    
    # Penalize based on issue count relative to file size
    with open(filepath, 'r') as f:
        line_count = sum(1 for _ in f)
    
    if line_count > 0:
        issue_density = len(issues) / line_count
        score -= min(issue_density * 5, 0.5)  # Cap penalty at 0.5
    
    return max(score, 0.0)

Heuristic Quality Signals

Beyond linting, several heuristic signals correlate with code quality. The FineWeb-Edu approach of using lightweight classifiers to score educational value has been adapted for code quality assessment in several recent projects:

python
import re
from typing import Dict

def compute_heuristic_quality_score(content: str, language: str) -> Dict:
    """Compute heuristic quality signals for a code file."""
    
    lines = content.split('\n')
    non_empty_lines = [l for l in lines if l.strip()]
    
    scores = {}
    
    # 1. Docstring/comment density
    comment_lines = count_comment_lines(content, language)
    scores['comment_density'] = min(
        comment_lines / max(len(non_empty_lines), 1), 0.4
    )  # Cap at 40% to avoid over-commented files
    
    # 2. Average identifier length (longer names = more descriptive)
    identifiers = extract_identifiers(content, language)
    if identifiers:
        avg_id_length = sum(len(i) for i in identifiers) / len(identifiers)
        scores['identifier_quality'] = min(avg_id_length / 12, 1.0)
    else:
        scores['identifier_quality'] = 0.5
    
    # 3. Single-character variable ratio (lower is better)
    if identifiers:
        single_char = sum(1 for i in identifiers if len(i) == 1)
        single_char_ratio = single_char / len(identifiers)
        scores['naming_quality'] = 1.0 - single_char_ratio
    else:
        scores['naming_quality'] = 0.5
    
    # 4. Type annotation coverage (for Python, TypeScript)
    if language in ('python', 'typescript'):
        type_annotations = len(re.findall(
            r':\s*(str|int|float|bool|List|Dict|Optional|Union|Any)', 
            content
        ))
        function_defs = len(re.findall(r'def\s+\w+', content))
        if function_defs > 0:
            scores['type_coverage'] = min(
                type_annotations / (function_defs * 2), 1.0
            )
        else:
            scores['type_coverage'] = 0.5
    
    # 5. Cyclomatic complexity proxy (fewer nested blocks = simpler)
    max_indent = 0
    for line in lines:
        if line.strip():
            indent = len(line) - len(line.lstrip())
            max_indent = max(max_indent, indent)
    
    scores['complexity'] = max(1.0 - (max_indent / 40), 0.0)
    
    # Composite score
    weights = {
        'comment_density': 0.2,
        'identifier_quality': 0.15,
        'naming_quality': 0.25,
        'type_coverage': 0.15,
        'complexity': 0.25,
    }
    
    composite = sum(
        scores.get(k, 0.5) * v 
        for k, v in weights.items()
    )
    
    scores['composite'] = composite
    return scores

The quality filtering for Stack Exchange data used similar multi-signal approaches: minimum answer scores, character length bounds, first-person language detection, and reference filtering. The principle is the same for code: combine multiple imperfect signals into a composite score that is significantly more reliable than any individual signal.

You do not need to exclude all lower-scoring code from your dataset. The LIMA ablation experiments showed that the relationship between quality and performance is continuous rather than binary. A mixture of quality levels can help the model learn to distinguish good code from bad code, particularly for tasks like code review and refactoring. The key is ensuring that high-quality code is overrepresented in your training distribution.

Step 7: Context Window Optimization and FIM Formatting

Modern code generation models operate on context windows ranging from 8K to 200K tokens. How you structure training examples to leverage these windows has a significant impact on the model's real-world capability.

Repository-Level Context Construction

Rather than training on isolated files, construct examples that include related files from the same repository. This teaches the model to reason about code in context, which is fundamental to how tools like Cursor and Devin operate in practice:

python
def construct_repo_context_example(repo_files, target_file, max_tokens=8192):
    """Build a training example with repository context."""
    
    context_parts = []
    token_budget = max_tokens
    
    # Priority 1: README (project understanding)
    readme = find_file(repo_files, 'README.md')
    if readme:
        truncated = truncate_to_tokens(readme['content'], 1024)
        context_parts.append(f"# README\n{truncated}")
        token_budget -= count_tokens(truncated)
    
    # Priority 2: Corresponding test file
    test_file = find_test_file(repo_files, target_file['filepath'])
    if test_file and token_budget > 512:
        truncated = truncate_to_tokens(test_file['content'], 2048)
        context_parts.append(
            f"# Test: {test_file['filepath']}\n{truncated}"
        )
        token_budget -= count_tokens(truncated)
    
    # Priority 3: Import dependencies (files imported by target)
    imports = extract_imports(target_file['content'])
    for imp in imports:
        imp_file = resolve_import(repo_files, imp)
        if imp_file and token_budget > 256:
            truncated = truncate_to_tokens(imp_file['content'], 1024)
            context_parts.append(
                f"# Dependency: {imp_file['filepath']}\n{truncated}"
            )
            token_budget -= count_tokens(truncated)
    
    # Priority 4: Configuration files
    for config_name in ['pyproject.toml', 'package.json', 'Cargo.toml']:
        config = find_file(repo_files, config_name)
        if config and token_budget > 128:
            truncated = truncate_to_tokens(config['content'], 512)
            context_parts.append(
                f"# Config: {config_name}\n{truncated}"
            )
            token_budget -= count_tokens(truncated)
    
    # Target file
    context_parts.append(
        f"# Target: {target_file['filepath']}\n{target_file['content']}"
    )
    
    return '\n\n'.join(context_parts)

Fill-in-the-Middle (FIM) Training Format

The FIM objective is critical for code completion use cases. Instead of always training left-to-right, FIM splits code into a prefix, middle, and suffix, then trains the model to predict the middle given the surrounding context. The two standard formats are PSM (Prefix-Suffix-Middle) and SPM (Suffix-Prefix-Middle):

python
import random

FIM_PREFIX = "<fim_prefix>"
FIM_MIDDLE = "<fim_middle>"
FIM_SUFFIX = "<fim_suffix>"

def apply_fim_transform(content, fim_rate=0.5):
    """Apply Fill-in-the-Middle transformation to a code example."""
    
    if random.random() > fim_rate:
        return content  # Standard left-to-right format
    
    lines = content.split('\n')
    
    if len(lines) < 3:
        return content
    
    # Choose a random split point for the middle section
    # Prefer splitting at function/class boundaries
    split_candidates = []
    for i, line in enumerate(lines):
        if i > 0 and i < len(lines) - 1:
            stripped = line.strip()
            if stripped.startswith('def ') or stripped.startswith('class ') \
               or stripped.startswith('function ') or stripped == '':
                split_candidates.append(i)
    
    if not split_candidates:
        split_candidates = list(range(1, len(lines) - 1))
    
    # Select middle section (1-30% of file)
    mid_start = random.choice(split_candidates)
    mid_length = random.randint(1, max(1, len(lines) // 3))
    mid_end = min(mid_start + mid_length, len(lines) - 1)
    
    prefix = '\n'.join(lines[:mid_start])
    middle = '\n'.join(lines[mid_start:mid_end])
    suffix = '\n'.join(lines[mid_end:])
    
    # PSM format (most common)
    if random.random() < 0.5:
        return f"{FIM_PREFIX}{prefix}{FIM_SUFFIX}{suffix}{FIM_MIDDLE}{middle}"
    else:
        # SPM format
        return f"{FIM_SUFFIX}{suffix}{FIM_PREFIX}{prefix}{FIM_MIDDLE}{middle}"

Most modern code generation models, including StarCoder 2 and Code Llama, use FIM on approximately 50 percent of training examples, with the remainder using standard left-to-right training. This balance ensures the model learns both generation and completion capabilities.

If you are building learning content about FIM training formats for your engineering team, visual demonstrations are much more effective than text-only explanations. The cinematic video generator on Miraflow can produce short visual explainers from text prompts.

Step 8: Synthetic Data Generation and Validation

Synthetic data has become one of the most powerful tools for filling gaps in organic code datasets. The Self-Instruct technique provides a framework for using existing models to generate training examples. For code, the approach needs to be adapted to ensure functional correctness.

Instruction-Response Pair Generation

Use a strong existing model to generate natural language descriptions of code functions, then pair them as instruction-completion examples:

python
INSTRUCTION_GEN_PROMPT = """Given the following Python function, generate a clear, 
specific natural language instruction that a developer might give to an AI assistant 
to produce this function. The instruction should describe WHAT the function does, 
not HOW it does it.

Function:
{code}

Generate only the instruction, nothing else."""

def generate_instruction_pairs(code_examples, generator_model):
    """Generate instruction-code pairs from existing code."""
    pairs = []
    
    for code in code_examples:
        instruction = generator_model.generate(
            INSTRUCTION_GEN_PROMPT.format(code=code)
        )
        
        pairs.append({
            'instruction': instruction.strip(),
            'response': code,
            'source': 'synthetic_instruction_gen',
        })
    
    return pairs

Test Case Generation for Validation

The most reliable quality signal for synthetic code is execution-based validation. Generate test cases alongside functions and run them:

python
import tempfile
import subprocess
import os

TEST_GEN_PROMPT = """Write comprehensive pytest test cases for the following 
Python function. Include edge cases, error handling, and typical usage patterns.

Function:
{code}

Generate only the test code, including necessary imports."""

def validate_synthetic_code(code, test_code, timeout_seconds=30):
    """Validate synthetic code by running generated tests."""
    
    combined = f"{code}\n\n{test_code}"
    
    # Write to temporary file
    with tempfile.NamedTemporaryFile(
        mode='w', suffix='.py', delete=False
    ) as f:
        f.write(combined)
        temp_path = f.name
    
    try:
        result = subprocess.run(
            ['python', '-m', 'pytest', temp_path, '-v', '--tb=short'],
            capture_output=True, text=True,
            timeout=timeout_seconds,
        )
        
        passed = result.returncode == 0
        
        return {
            'passed': passed,
            'stdout': result.stdout,
            'stderr': result.stderr,
            'return_code': result.returncode,
        }
        
    except subprocess.TimeoutExpired:
        return {'passed': False, 'error': 'timeout'}
    finally:
        os.unlink(temp_path)

The LIMA paper included 50 examples from Super-Natural Instructions to add diversity to their training mix, even though "the distribution of potential user prompts is arguably different from the distribution of tasks in Super-Natural Instructions." Their intuition was that "this small sample adds diversity to the overall mix of training examples, and can potentially increase model robustness." The same principle applies to synthetic code data: even a small proportion (5 to 15 percent of total training data) of validated synthetic examples can meaningfully improve coverage of underrepresented patterns.

The critical caveat, emphasized by both the LIMA and AWS research, is quality control. Unverified synthetic data introduces subtle errors that compound during training. Every synthetic example should pass execution-based validation or secondary model-based scoring before inclusion.

Step 9: Data Mixing and Curriculum Design

The final pre-training step is defining how your processed data subsets are mixed and ordered. The LIMA paper provides direct evidence for the importance of this step: their 1,000 training examples were carefully balanced across Stack Exchange STEM (200), Stack Exchange Other (200), wikiHow (200), Reddit WritingPrompts (150), Natural Instructions (50), and manually authored examples (200).

Language Distribution

For code datasets, the programming language distribution significantly impacts performance. Training with too much Python relative to other languages improves Python benchmarks at the expense of everything else. A temperature-based sampling strategy, similar to what was used in the Llama 3 pre-training data mix, upsamples lower-resource languages:

python
import math

def compute_sampling_weights(language_counts, temperature=0.7):
    """Compute sampling weights with temperature-based smoothing."""
    
    total = sum(language_counts.values())
    
    # Raw proportions
    proportions = {
        lang: count / total 
        for lang, count in language_counts.items()
    }
    
    # Apply temperature scaling
    # temperature < 1.0 flattens the distribution (upsamples rare languages)
    # temperature > 1.0 sharpens it (concentrates on common languages)
    scaled = {
        lang: math.pow(prop, temperature) 
        for lang, prop in proportions.items()
    }
    
    # Normalize
    scale_total = sum(scaled.values())
    weights = {
        lang: val / scale_total 
        for lang, val in scaled.items()
    }
    
    return weights

# Example usage
language_counts = {
    'python': 5_000_000,
    'javascript': 4_000_000,
    'typescript': 2_000_000,
    'java': 3_000_000,
    'cpp': 1_500_000,
    'go': 800_000,
    'rust': 400_000,
    'ruby': 300_000,
    'scala': 100_000,
    'julia': 50_000,
}

weights = compute_sampling_weights(language_counts, temperature=0.7)
# Result: Rust and Julia get upsampled, Python gets downsampled

Curriculum Ordering

Research suggests that presenting cleaner, simpler examples first and gradually introducing complexity can improve convergence. For code, a curriculum might progress through three phases: Phase 1 consisting of well-documented single-function examples with tests, Phase 2 involving multi-file repository-level examples with import chains, and Phase 3 covering challenging edge cases, ambiguous specifications, and debugging trajectories.

python
def assign_curriculum_phase(record):
    """Assign a curriculum phase based on example complexity."""
    
    complexity_score = record.get('quality_scores', {}).get('complexity', 0.5)
    has_tests = record.get('has_corresponding_test', False)
    file_count = record.get('context_file_count', 1)
    
    if complexity_score > 0.7 and has_tests and file_count == 1:
        return 1  # Simple, well-tested, single-file
    elif file_count > 1 and complexity_score > 0.4:
        return 2  # Multi-file, moderate complexity
    else:
        return 3  # Complex or edge-case

The Miraflow content creation dashboard provides a unified workspace for generating visuals, videos, and other media that can help communicate your data mixing strategy to team members and stakeholders.

Step 10: Building Instruction-Tuning and Preference Datasets for Code

Beyond the pre-training corpus, code generation models benefit from instruction-tuning and preference-tuning datasets. The Stanford Alpaca project demonstrated how the Self-Instruct technique can be used to generate instruction-tuning datasets at scale.

Instruction-Tuning Format for Code

Code-specific instruction datasets include columns for instruction type, the instruction itself, input context, and expected output:

json
{
    "type": "code_generation",
    "instruction": "Write a Python function that implements binary search on a sorted list.",
    "input": "The function should return the index of the target element, or -1 if not found.",
    "output": "def binary_search(arr, target):\n    left, right = 0, len(arr) - 1\n    while left <= right:\n        mid = (left + right) // 2\n        if arr[mid] == target:\n            return mid\n        elif arr[mid] < target:\n            left = mid + 1\n        else:\n            right = mid - 1\n    return -1"
}

Preference-Tuning (DPO) Format for Code

For DPO-style training requires "an input or prompt, a chosen (preferred) response, and a rejected (less preferred) response," the code-specific version includes functional correctness as a key differentiator:

json
{
    "prompt": "Write a function to find all prime numbers up to n using the Sieve of Eratosthenes.",
    "chosen": {
        "content": "def sieve_of_eratosthenes(n):\n    if n < 2:\n        return []\n    is_prime = [True] * (n + 1)\n    is_prime[0] = is_prime[1] = False\n    for i in range(2, int(n**0.5) + 1):\n        if is_prime[i]:\n            for j in range(i*i, n + 1, i):\n                is_prime[j] = False\n    return [i for i in range(2, n + 1) if is_prime[i]]",
        "role": "assistant"
    },
    "rejected": {
        "content": "def find_primes(n):\n    primes = []\n    for num in range(2, n+1):\n        for i in range(2, num):\n            if num % i == 0:\n                break\n        else:\n            primes.append(num)\n    return primes",
        "role": "assistant"
    },
    "preference_reason": "The chosen response uses the efficient O(n log log n) Sieve of Eratosthenes algorithm with proper edge case handling, while the rejected response uses a naive O(n * sqrt(n)) trial division approach."
}

"Cohort-based labeling" approach means "more than two LLMs are asked to generate the label for the same data" and "the label is accepted if both models agree with each other's response." This technique is particularly valuable for code preference datasets, where you can have multiple models generate solutions and then validate agreement on which solution is better through execution-based testing.

Step 11: Execution-Based Validation at Scale

Running code in sandboxed environments provides the strongest quality signal available for code datasets. This is called as synthetic data validation but requires containerized execution environments for safety.

python
import docker
import tempfile
import os

class CodeExecutionValidator:
    """Validate code examples by executing them in Docker containers."""
    
    LANGUAGE_IMAGES = {
        'python': 'python:3.11-slim',
        'javascript': 'node:20-slim',
        'go': 'golang:1.22-alpine',
        'rust': 'rust:1.77-slim',
        'java': 'openjdk:21-slim',
    }
    
    def __init__(self):
        self.client = docker.from_env()
    
    def validate(self, code, language, timeout=30):
        """Execute code in a sandboxed container and check for errors."""
        
        image = self.LANGUAGE_IMAGES.get(language)
        if not image:
            return {'status': 'unsupported_language'}
        
        with tempfile.TemporaryDirectory() as tmpdir:
            # Write code to file
            ext = self._get_extension(language)
            code_path = os.path.join(tmpdir, f'solution{ext}')
            
            with open(code_path, 'w') as f:
                f.write(code)
            
            # Run in container
            cmd = self._get_run_command(language, f'/code/solution{ext}')
            
            try:
                result = self.client.containers.run(
                    image,
                    command=cmd,
                    volumes={tmpdir: {'bind': '/code', 'mode': 'ro'}},
                    mem_limit='256m',
                    cpu_period=100000,
                    cpu_quota=50000,
                    network_disabled=True,
                    remove=True,
                    timeout=timeout,
                )
                
                return {
                    'status': 'success',
                    'output': result.decode('utf-8', errors='replace'),
                }
                
            except docker.errors.ContainerError as e:
                return {
                    'status': 'runtime_error',
                    'error': str(e),
                }
            except Exception as e:
                return {
                    'status': 'execution_error',
                    'error': str(e),
                }
    
    def _get_extension(self, language):
        return {
            'python': '.py', 'javascript': '.js', 
            'go': '.go', 'rust': '.rs', 'java': '.java'
        }.get(language, '.txt')
    
    def _get_run_command(self, language, filepath):
        return {
            'python': f'python {filepath}',
            'javascript': f'node {filepath}',
            'go': f'go run {filepath}',
            'rust': f'rustc {filepath} -o /tmp/out && /tmp/out',
        }.get(language, f'cat {filepath}')

This validation is computationally expensive, so a practical compromise is to apply it to your highest-value data subsets, such as synthetic examples, competitive programming solutions, and function-test pairs, while relying on static analysis for the broader corpus.

Step 12: Continuous Evaluation and Iteration

Building a code dataset is not a one-time task. I'd like to emphasize the importance of evaluation-driven iteration: we use both human evaluation (with inter-annotator agreement scores of 78-82%) and automated GPT-4 evaluation to measure model quality, and they iteratively refined their data based on findings.

Your evaluation suite should include established benchmarks such as HumanEval (164 Python programming problems), HumanEval+ (with expanded test cases from the EvalPlus project), MBPP (974 crowd-sourced Python programming problems), SWE-bench (real GitHub issues requiring repository-level reasoning), and MultiPL-E (HumanEval translated to 18 programming languages).

The LIMA paper also highlights an important finding about evaluation metrics: "perplexity does not correlate with generation quality." They observed that as perplexity rose with more training steps (typically a sign of overfitting), generation quality actually increased. This means you cannot rely solely on perplexity as your evaluation metric for code generation models. Pass@k rates on execution-based benchmarks and human evaluation of code quality provide much more reliable signals.

For sharing evaluation results and progress updates with your team or community, creating short-form video content is an effective distribution channel. You can also generate AI images for charts and visualizations to include in technical reports and presentations.

How Claude Code, Cursor, and Devin Approach Their Training Data

While the exact training data compositions for these tools remain proprietary, their published research and documented capabilities reveal important patterns about their dataset engineering strategies.

Anthropic's Claude Code benefits from what the LIMA research framework calls the Superficial Alignment Hypothesis: a strong pre-training foundation combined with carefully curated alignment data. Anthropic has published extensively on Constitutional AI and RLHF, and their approach to code generation involves pairing large-scale code pre-training with instruction-tuning data that teaches the model to follow complex multi-step coding instructions. The quality of the human feedback data used in RLHF is at least as important as the quality of the pre-training code data.

Cursor has differentiated itself through deep IDE integration and repository-level context awareness. Their approach emphasizes training on data that mirrors actual developer editing patterns, including partial file edits, cursor-position-aware completions, and multi-file refactoring chains. This implies their dataset contains a significant proportion of structured edit sequences formatted with FIM-style objectives rather than isolated code files.

Devin from Cognition represents the agentic approach. Devin's ability to autonomously plan, execute, and debug code implies training on trajectory data: sequences of actions including code writing, terminal commands, browser interactions, and error resolution. This goes beyond standard code generation datasets and requires capturing the full software development workflow as training signal.

The common thread across all three is that the training data is structured to match the inference-time use case. This principle should guide your own dataset engineering decisions.

Common Mistakes That Degrade Code Dataset Quality

Based on the research literature and practical experience, several failure modes consistently undermine code dataset quality.

Skipping deduplication is the most common and most costly mistake. The duplicated training examples are pervasive in common NLP datasets. For code, the problem is even worse due to GitHub forking, template repositories, and vendored dependencies.

Ignoring license compliance creates legal risk. Every major code generation project that has faced public controversy has had licensing at the center. The StarCoder project set the standard by implementing transparent license filtering and opt-out mechanisms for The Stack dataset.

Including outdated or deprecated code without labeling teaches the model to suggest patterns that no longer work. APIs evolve, frameworks change, and best practices shift. Your dataset should reflect current practices.

Neglecting natural language context produces models that complete code but struggle with instructions. The LIMA paper showed that the 250 manually authored examples with carefully written natural language prompts were essential for teaching the model to understand user intent. The natural language component of a code dataset bridges the gap between predicting the next token and understanding what the developer is trying to build.

The LIMA paper also found that including just 13 safety-related training examples with carefully written rejection responses caused the model to "respond safely to 80% of potentially sensitive prompts." This suggests that even a small number of well-crafted safety examples can meaningfully improve a code model's ability to refuse harmful requests, such as generating malware or exploiting vulnerabilities.

Production Pipeline Architecture

Moving from a prototype to a production-grade pipeline requires engineering infrastructure that supports incremental updates, provenance tracking, and reproducible builds.

A production code dataset pipeline typically includes a data ingestion layer that handles GitHub API integration, documentation scraping, and Q&A platform exports. A processing layer handles extraction, filtering, deduplication, and quality scoring. A storage layer uses object storage for raw and processed data with Parquet or Arrow formats for efficient columnar access. An orchestration layer coordinates pipeline stages and handles retries and monitoring. A versioning layer tracks dataset versions alongside model training code.

python
# Example pipeline orchestration pseudocode
pipeline = DatasetPipeline(
    stages=[
        GitHubIngestionStage(
            min_stars=5,
            languages=['python', 'javascript', 'typescript', 
                       'java', 'go', 'rust'],
            license_filter=PERMISSIVE_LICENSES,
        ),
        FileExtractionStage(
            max_file_size=100_000,
            skip_patterns=SKIP_FILENAMES | SKIP_DIRECTORIES,
        ),
        ExactDeduplicationStage(hash_algorithm='sha256'),
        NearDeduplicationStage(
            method='minhash_lsh',
            threshold=0.8,
            num_perm=128,
        ),
        QualityScoringStage(
            linting=True,
            heuristic_scoring=True,
            min_composite_score=0.3,
        ),
        FIMFormattingStage(fim_rate=0.5),
        DataMixingStage(
            language_temperature=0.7,
            curriculum_phases=3,
        ),
        ValidationStage(
            execution_validation_rate=0.1,  # 10% of examples
            static_analysis_rate=1.0,       # All examples
        ),
    ],
    output_format='parquet',
    output_path='s3://dataset-bucket/code-gen/v1/',
)

pipeline.run()

Tools like DVC (Data Version Control) or LakeFS allow you to track dataset versions alongside model code, making it possible to reproduce any previous training run. The Hugging Face DataTrove library provides additional tooling for large-scale data processing pipelines.

For documenting your pipeline and sharing progress with stakeholders, the Miraflow AI platform offers a complete toolkit for generating images, videos, thumbnails, and audio directly in your browser.

Conclusion

Building a high-quality dataset for code generation AI is the single highest-leverage activity when developing a coding assistant. The research behind LIMA, CCNet, StarCoder, etc. all converge on the same conclusion: model capability is bounded by data quality, and careful curation of a smaller dataset consistently outperforms bulk collection of a larger one.

The pipeline described in this guide covers source extraction and preprocessing, license filtering, repository-level and file-level quality filtering, exact and near-duplicate removal using MinHash LSH, code quality scoring through static analysis and heuristics, context window optimization with FIM formatting, synthetic data generation with execution-based validation, instruction-tuning and preference dataset construction, data mixing with temperature-based language sampling, and continuous evaluation against established benchmarks.

Each step builds on the previous one, and skipping any step introduces compounding quality degradation. Whether you are fine-tuning an open-weight model for a specific domain or building a foundation model from scratch, investing in dataset engineering yields the highest return per hour of effort spent. Start with the techniques described here, iterate based on your evaluation results, and treat your dataset as a living product that requires ongoing maintenance and improvement.

References and Sources

[1] Zhou, C., Liu, P., Xu, P., et al. "LIMA: Less Is More for Alignment." arXiv preprint arXiv:2305.11206, 2023.

[2] Zamarin, S., Ping, D., Elango, V., et al. "An introduction to preparing your own dataset for LLM training." AWS Machine Learning Blog, December 2024.

[3] Wenzek, G., Lachaux, M.A., Conneau, A., et al. "CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data." arXiv preprint arXiv:1911.00359, 2019.

[4] Li, R., Allal, L.B., Zi, Y., et al. "StarCoder: May the source be with you!" arXiv preprint arXiv:2305.06161, 2023.

[5] Lozhkov, A., Li, R., Allal, L.B., et al. "StarCoder 2 and The Stack v2: The Next Generation." arXiv preprint arXiv:2402.19173, 2024.

[6] Rozière, B., Gehring, J., Gloeckle, F., et al. "Code Llama: Open Foundation Models for Code." arXiv preprint arXiv:2308.12950, 2023.

[7] Bavarian, M., Jun, H., Tezak, N., et al. "Efficient Training of Language Models to Fill in the Middle." arXiv preprint arXiv:2207.14255, 2022.

[8] Penedo, G., Kydlíček, H., Lozhkov, A., et al. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv preprint arXiv:2406.17557, 2024.

[9] Dubey, A., Jauhri, A., Pandey, A., et al. "The Llama 3 Herd of Models." arXiv preprint arXiv:2407.21783, 2024.

[10] Touvron, H., Lavril, T., Izacard, G., et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971, 2023.

[11] Chen, M., Tworek, J., Jun, H., et al. "Evaluating Large Language Models Trained on Code." arXiv preprint arXiv:2107.03374, 2021.

[12] Liu, J., Xia, C.S., Wang, Y., Zhang, L. "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Code Generation with EvalPlus." arXiv preprint arXiv:2305.01210, 2023.

[13] Austin, J., Odena, A., Nye, M., et al. "Program Synthesis with Large Language Models." arXiv preprint arXiv:2108.07732, 2021.

[14] Jimenez, C.E., Yang, J., Wettig, A., et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv preprint arXiv:2310.06770, 2023.

[15] Cassano, F., Gouwar, J., Nguyen, D., et al. "MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation." arXiv preprint arXiv:2208.08227, 2022.

[16] Wang, Y., Kordi, Y., Mishra, S., et al. "Self-Instruct: Aligning Language Model with Self Generated Instructions." arXiv preprint arXiv:2212.10560, 2022.

[17] Taori, R., Gulrajani, I., Zhang, T., et al. "Stanford Alpaca: An Instruction-following LLaMA model." GitHub, 2023.

[18] Rafailov, R., Sharma, A., Mitchell, E., et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." arXiv preprint arXiv:2305.18290, 2023.

[19] Ouyang, L., Wu, J., Jiang, X., et al. "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems, 2022.

[20] Bai, Y., Kadavath, S., Kundu, S., et al. "Constitutional AI: Harmlessness from AI Feedback." arXiv preprint arXiv:2212.08073, 2022.

[21] Wang, Y., Mishra, S., Alipoormolabashi, P., et al. "Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks." EMNLP, 2022.

[22] OpenAI. "GPT-4 Technical Report." arXiv preprint arXiv:2303.08774, 2023.

[23] The BigCode Project. Open Scientific Collaboration for Open Code LLMs.

[24] The Stack Dataset. Hugging Face Datasets Hub.

[25] Hugging Face DataTrove. Large-scale data processing library.

[26] EvalPlus Leaderboard. Code generation evaluation framework.