Large Codebase Optimization Guide

RE-cue Python version includes comprehensive performance optimizations designed for analyzing large codebases with 1000+ files. This guide explains the optimization features and how to use them effectively.

Overview

When analyzing large enterprise codebases, performance becomes critical. RE-cue addresses this with:

Parallel Processing: Concurrent file analysis using multiprocessing
Incremental Analysis: Skip unchanged files on re-analysis
Memory Efficiency: Safe handling of large files
Progress Reporting: Live feedback during long-running analysis
Early Termination: Graceful handling of errors and interruptions

Performance Features

1. Parallel File Processing

Analyzes multiple files concurrently using Python’s ProcessPoolExecutor.

Benefits:

Faster analysis on multi-core systems
Automatic worker count optimization
Scales with available CPU cores

Configuration:

# Use default parallel processing (enabled by default)
re-cue --spec --path ~/large-project

# Specify worker count explicitly
re-cue --spec --max-workers 8 --path ~/large-project

# Disable for debugging (sequential processing)
re-cue --spec --no-parallel --path ~/large-project

Performance Characteristics:

Automatically uses optimal worker count (CPU cores)
Threshold: Only activates for 10+ files
Overhead: Minimal for small projects, significant speedup for large ones

2. Incremental Analysis

Tracks file metadata and skips unchanged files on repeated analysis.

Benefits:

5-6x speedup on re-analysis of unchanged files
Automatic change detection
Persistent state across runs

How It Works:

First run: Analyzes all files, stores metadata (size, mtime)
Subsequent runs: Compares current metadata with stored state
Only processes files that have changed
Updates metadata for processed files

Configuration:

# Enable incremental (default)
re-cue --spec --path ~/large-project

# Force full re-analysis
re-cue --spec --no-incremental --path ~/large-project

# Check what changed
re-cue --spec --verbose --path ~/large-project

State Management:

State file: specs/001-reverse/.file_tracker_state.json
Contains: File paths, sizes, modification times
Persistent across analysis sessions

Use Cases:

Continuous Integration: Fast re-analysis after code changes
Iterative Development: Quick updates during active development
Documentation Maintenance: Keep docs in sync with minimal overhead

3. Memory Efficient File Reading

Safely handles large files without exhausting memory.

Features:

File size limits (default: 10MB per file)
Stream-based reading with error recovery
Prevents crashes on oversized files

Configuration: Files exceeding the size limit are logged and skipped automatically.

Example:

# In optimized_analyzer.py
from reverse_engineer.optimization import read_file_efficiently

content = read_file_efficiently(file_path, max_size_mb=10)

4. Progress Reporting

Real-time feedback during analysis with progress bars and ETA.

Features:

Live progress bars with percentage
Estimated time remaining (ETA)
Error tracking and summary
Configurable verbosity

Configuration:

# Verbose mode (detailed progress)
re-cue --spec --verbose --path ~/large-project

# Quiet mode (minimal output)
re-cue --spec --path ~/large-project

Output Example:

Analyzing controllers: [████████████████░░░░] 75.0% (150/200) ETA: 12s

5. Early Termination & Error Handling

Graceful handling of errors and interruptions.

Features:

Configurable error thresholds (default: 10 max errors)
Signal handlers for Ctrl+C (SIGINT) and SIGTERM
Clean worker process shutdown
Error summary reporting

Configuration:

# In optimized_analyzer.py
processor = ParallelProcessor(
    max_workers=4,
    max_errors=10,  # Stop after 10 errors
    verbose=True
)

Error Handling:

Individual file errors don’t stop entire analysis
After max errors reached, analysis stops gracefully
All workers are cleaned up properly
Error summary displayed at end

Performance Benchmarks

Test Environment

Project: Spring Boot application
Files: 225 total (50 controllers, 100 models, 75 services)
System: Standard CI environment

Results

Scenario	Time	Speedup
First analysis (all files)	0.023s	Baseline
Re-analysis (unchanged, incremental)	0.004s	5.96x
Re-analysis (no incremental)	0.023s	1.0x
Parallel (50 controllers)	0.008s	N/A
Sequential (50 controllers)	0.010s	N/A

Key Insights:

Incremental analysis provides dramatic speedup for unchanged files
Parallel processing shows benefit at scale (1000+ files)
For small projects (<100 files), overhead may outweigh benefits
Best performance: Combine incremental + parallel for large projects

Scaling Characteristics

File Count	Sequential Time	Parallel Time (4 workers)	Speedup
10 files	~0.002s	~0.003s	0.67x (overhead)
50 files	~0.010s	~0.008s	1.25x
200 files	~0.040s	~0.015s	2.67x
1000 files	~0.200s	~0.060s	3.33x

Best Practices

For Large Codebases (1000+ files)

# Recommended configuration
re-cue --spec --plan \
  --verbose \
  --max-workers 8 \
  --path ~/enterprise-app

Why:

Verbose mode: Shows progress for long-running analysis
8 workers: Good balance for most systems
Incremental: Enabled by default, saves time on re-runs

For Continuous Integration

# CI environment
re-cue --spec \
  --incremental \
  --max-workers 4 \
  --path $CI_PROJECT_DIR

Why:

Incremental: Leverages cached state from previous runs
4 workers: Conservative for shared CI runners
Parallel enabled: Faster on first run

For Development/Iteration

# During active development
re-cue --spec \
  --verbose \
  --path ~/my-project

Why:

Default optimizations: Best performance
Verbose: See what changed and was re-analyzed
Incremental: Fast updates after code changes

For Debugging

# Debug mode
re-cue --spec \
  --no-parallel \
  --no-incremental \
  --verbose \
  --path ~/problematic-project

Why:

Sequential: Easier to debug errors
No incremental: Ensure full re-analysis
Verbose: Maximum diagnostic output

Troubleshooting

Slow Analysis

Problem: Analysis takes too long

Solutions:

# Check if parallel is enabled
re-cue --spec --verbose  # Look for "Using optimized processing"

# Increase workers
re-cue --spec --max-workers 16

# Verify incremental is working
re-cue --spec --verbose  # Look for "Skipping N unchanged files"

High Memory Usage

Problem: Process uses too much memory

Solutions:

# Reduce worker count
re-cue --spec --max-workers 2

# Check for very large files
re-cue --spec --verbose  # Look for "File too large" warnings

Stale Results

Problem: Changes not reflected in output

Solutions:

# Force full re-analysis
re-cue --spec --no-incremental

# Clear state and re-run
rm -f specs/001-reverse/.file_tracker_state.json
re-cue --spec

Errors During Parallel Processing

Problem: Errors only occur with parallel processing

Solutions:

# Use sequential for debugging
re-cue --spec --no-parallel --verbose

# Check error summary at end of run
# Increase error threshold if needed (code modification)

Advanced Configuration

Custom Worker Count

Determine optimal worker count:

from reverse_engineer.optimization import get_optimal_worker_count

# For your file count
optimal = get_optimal_worker_count(file_count=1500)
print(f"Recommended workers: {optimal}")

Programmatic Usage

Use optimizations in Python code:

from pathlib import Path
from reverse_engineer.analyzer import ProjectAnalyzer

analyzer = ProjectAnalyzer(
    repo_root=Path("~/large-project"),
    verbose=True,
    enable_optimizations=True,
    enable_incremental=True,
    max_workers=8
)

analyzer.analyze()

File Tracker State

Inspect or modify state:

from pathlib import Path
from reverse_engineer.optimization import FileTracker

tracker = FileTracker(Path("specs/001-reverse/.file_tracker_state.json"))

# Check if file changed
changed = tracker.has_changed(Path("src/Controller.java"))

# Manually update file
tracker.update_file(Path("src/Controller.java"))
tracker.save_state()

Future Enhancements

Planned optimizations for future releases:

Result Caching: Cache analysis results, not just file metadata
Distributed Processing: Analyze across multiple machines
Smart Prioritization: Analyze critical files first
Compression: Compress state files for large projects
Index Building: Pre-build indexes for faster queries

Summary

RE-cue’s optimization features make it suitable for enterprise-scale codebases:

Feature	Benefit	Impact
Parallel Processing	Faster analysis	2-3x speedup for 1000+ files
Incremental Analysis	Skip unchanged files	5-6x speedup on re-runs
Memory Efficiency	Handle large files	Prevents crashes
Progress Reporting	Better UX	Real-time feedback
Error Handling	Robust analysis	Graceful degradation

Bottom Line: For projects with 1000+ files, use RE-cue Python version with default optimizations for best performance.

For questions or issues, see the main TROUBLESHOOTING.md guide.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.