Leemer Heavy and Heavy (Fast) represent a new approach to AI model architecture.
Instead of monolithic models trying to do everything, we use union architectures where a single orchestrator delegates to specialists when helpful. This gives us the best of both worlds: decisive control with specialized capabilities.
The Philosophy: Why Union Models?
Traditional AI models are monolithic—they try to do everything in one pass. Leemer Heavy takes a different approach: it's a union model architecture where a single orchestrator (GPT-5 Mini) delegates to specialist models when helpful.
The orchestrator thinks independently first, then routes selectively. It doesn't just chain models together—it makes intelligent decisions about when to call research, when to do deep reasoning, and when to refine. This gives us the best of both worlds: the decisive control of a single model with the specialized capabilities of multiple experts.
In our internal evaluations across coding, analysis, and research workflows, Leemer Heavy outperformed GPT-5.1 Chat (including GPT-5.1 Chat with web search grounding) while maintaining faster response times. The key is selective delegation—not every request needs every delegate.
Leemer Heavy: Union Model Architecture
A single orchestrator that delegates to specialist models when helpful
Primary control brain that thinks independently first, then routes selectively to delegates
Fresh, cited knowledge
Stepwise analysis
Braids insights
Pressure testing
Integrates all delegate insights into a comprehensive, self-contained answer
Fresh, cited knowledge
Stepwise analysis
Braids insights
Pressure testing
Execution Flow: Iterative Orchestration
How Leemer Heavy processes a request through multiple iterations
Internal-only outline (no UI output)
Proposes high-signal follow-up queries
Research → Reasoning → Refinement → Gap Analysis
Pressure-testing every 5th delegate
Integrates all insights into comprehensive answer
Internal-only outline (no UI output)
Proposes high-signal follow-up queries
Research → Reasoning → Refinement → Gap Analysis
Pressure-testing every 5th delegate
Integrates all insights into comprehensive answer
Leemer Heavy: Deep Research Orchestration
Leemer Heavy is designed for comprehensive, well-grounded answers. It uses an iterative orchestration pattern that can loop up to 6 times (configurable), each iteration refining and expanding the answer.
The workflow starts with hidden planning—an internal-only outline that never appears to users. This keeps the UI clean while giving the orchestrator structure. Then comes query planning, where the system proposes high-signal follow-up queries that are deduplicated against an execution set to avoid redundant research calls.
Each iteration follows a pattern: Research (Perplexity Sonar for fresh, cited knowledge) → Reasoning (GLM-4.6 for rigorous, stepwise analysis) → Refinement (Qwen-3-235B to weave delegate insights into cohesive bridges). After each iteration, a gap analysis decides whether to continue or synthesize.
Every fifth delegate completion (or at the final iteration), a challenger (Grok-4-Flash) activates to pressure-test assumptions with alternative angles. This red-teaming ensures the answer is robust, not just comprehensive.
Finally, the synthesis model (GPT-5 Mini) receives all the sanitized delegate transcripts and produces a long, self-contained answer. The answer stands alone even if delegate outputs were removed—it integrates and expands on them, not just summarizes.
Leemer Heavy (Fast): Debate Synthesis System
Two models exchange ideas before synthesis for rapid, multi-perspective answers
Presents initial thoughts
Counter-argues
Refines position
Final counterpoint
Integrates best insights
Presents initial thoughts
Counter-argues
Refines position
Final counterpoint
Integrates best insights
Leemer Heavy (Fast): Rapid Debate Synthesis
Heavy (Fast) takes a completely different approach. Instead of iterative research loops, it uses a structured debate system where two models exchange ideas before synthesis.
The debate follows a fixed 4-stage pattern: Proponent (Gemini 2.5 Flash Lite) presents initial thoughts → Challenger (Qwen3 Next 80B) counter-argues → Proponent refines position → Challenger provides final counterpoint. Optionally, a pre-debate research snapshot can provide context, but the focus is on rapid multi-perspective exploration.
After the debate, a synthesis model (Kimi Linear 48B) integrates the best insights from both sides into a comprehensive answer. The synthesis doesn't just summarize—it builds upon the debate to create a richer, more complete answer than either side alone.
This approach is faster (30-90 seconds vs 2-4 minutes) and uses fewer delegate calls (4-5 fixed vs up to 20+), making it ideal for questions that benefit from multiple perspectives but don't require deep research iteration.
Heavy vs Heavy (Fast): When to Use Each
Understanding the trade-offs between depth and speed
| Feature | Leemer Heavy | Heavy (Fast) |
|---|---|---|
| Primary Approach | Iterative research orchestration | Rapid debate synthesis |
| Orchestrator | GPT-5 Mini | Direct debate flow |
| Research | Perplexity Sonar (iterative) | Optional pre-debate snapshot |
| Reasoning | GLM-4.6 (deep analysis) | Built into debate stages |
| Synthesis Model | GPT-5 Mini | Kimi Linear 48B |
| Max Iterations | 6 (configurable) | Fixed 4-stage debate |
| Best For | Deep research, comprehensive analysis | Quick multi-perspective answers |
| Budget | 270 seconds | 180 seconds |
Performance Characteristics
How each model performs under different conditions
The Technical Stack
Both models share a common orchestration core (`leemer-orchestrator-core.ts`) that provides timeout management, streaming control, telemetry recording, and delegate registry. This shared foundation ensures consistency and makes it easy to add new delegates or modify workflows.
The delegate registry pattern allows each specialist to be a pure async handler that accepts arguments and a delegate context. Delegates emit timing metadata in their tool events, which the UI uses to display start times and durations.
Context management is critical. The system maintains rolling context windows, sanitizes delegate text to remove tool-signature tokens, and deduplicates queries. Continuation heuristics detect truncated answers and automatically patch them using the delegate trail.
All configuration is centralized in `leemer-config.ts`, which loads model IDs, timeouts, feature flags, and budgets from environment variables. This makes it easy to tune performance or swap models without changing core logic.
Why This Architecture Works
The union model approach solves a fundamental problem: no single model is best at everything. GPT-5 Mini is excellent at orchestration and synthesis, but Perplexity Sonar has better web search capabilities. GLM-4.6 excels at stepwise reasoning, while Qwen-3-235B is better at long-form refinement.
By letting the orchestrator decide when to delegate, we get the right tool for the right job. Simple questions might only need synthesis. Complex research questions trigger the full iterative loop. Questions needing multiple perspectives use the debate system.
The architecture is also extensible. New delegates can be added to the registry without changing core orchestration logic. Feature flags control optional behaviors (like debate injection in Heavy or pre-debate research in Fast). Timeout budgets prevent runaway costs while allowing flexibility.
Most importantly, the system is transparent. Users can see delegate activity in the UI (collapsed by default), and telemetry provides full observability. This transparency builds trust and helps debug issues when they arise.
Union model architectures represent a new paradigm in AI: instead of building bigger monolithic models, we're building smarter orchestrators that know when to delegate. This approach is more efficient, more transparent, and more capable than any single model alone.
— Repath Khan
Founder, LeemerChat