Behind the scenes: How Leemer Heavy and Heavy (Fast) work

Leemer Heavy and Heavy (Fast) represent a new approach to AI model architecture.

Instead of monolithic models trying to do everything, we use union architectures where a single orchestrator delegates to specialists when helpful. This gives us the best of both worlds: decisive control with specialized capabilities.

The Philosophy: Why Union Models?

Traditional AI models are monolithic—they try to do everything in one pass. Leemer Heavy takes a different approach: it's a union model architecture where a single orchestrator (GPT-5 Mini) delegates to specialist models when helpful.

The orchestrator thinks independently first, then routes selectively. It doesn't just chain models together—it makes intelligent decisions about when to call research, when to do deep reasoning, and when to refine. This gives us the best of both worlds: the decisive control of a single model with the specialized capabilities of multiple experts.

In our internal evaluations across coding, analysis, and research workflows, Leemer Heavy outperformed GPT-5.1 Chat (including GPT-5.1 Chat with web search grounding) while maintaining faster response times. The key is selective delegation—not every request needs every delegate.

Leemer Heavy: Union Model Architecture

A single orchestrator that delegates to specialist models when helpful

Orchestrator

GPT-5 Mini

Primary control brain that thinks independently first, then routes selectively to delegates

Perplexity Sonar

Research

Fresh, cited knowledge

GLM-4.6

Reasoning

Stepwise analysis

Qwen-3-235B

Refinement

Braids insights

Grok-4-Flash

Challenger

Pressure testing

Final Synthesis

GPT-5 Mini

Integrates all delegate insights into a comprehensive, self-contained answer

Orchestrator

GPT-5 Mini

Perplexity Sonar

Research

Fresh, cited knowledge

GLM-4.6

Reasoning

Stepwise analysis

Qwen-3-235B

Refinement

Braids insights

Grok-4-Flash

Challenger

Pressure testing

Final Synthesis

GPT-5 Mini

Execution Flow: Iterative Orchestration

How Leemer Heavy processes a request through multiple iterations

Hidden Planning

Internal-only outline (no UI output)

Query Planning

Proposes high-signal follow-up queries

Iterative Research Loop

Research → Reasoning → Refinement → Gap Analysis

Challenger Cadence

Pressure-testing every 5th delegate

Final Synthesis

Integrates all insights into comprehensive answer

Hidden Planning

Internal-only outline (no UI output)

Query Planning

Proposes high-signal follow-up queries

Iterative Research Loop

Research → Reasoning → Refinement → Gap Analysis

Challenger Cadence

Pressure-testing every 5th delegate

Final Synthesis

Integrates all insights into comprehensive answer

Leemer Heavy: Deep Research Orchestration

Leemer Heavy is designed for comprehensive, well-grounded answers. It uses an iterative orchestration pattern that can loop up to 6 times (configurable), each iteration refining and expanding the answer.

The workflow starts with hidden planning—an internal-only outline that never appears to users. This keeps the UI clean while giving the orchestrator structure. Then comes query planning, where the system proposes high-signal follow-up queries that are deduplicated against an execution set to avoid redundant research calls.

Each iteration follows a pattern: Research (Perplexity Sonar for fresh, cited knowledge) → Reasoning (GLM-4.6 for rigorous, stepwise analysis) → Refinement (Qwen-3-235B to weave delegate insights into cohesive bridges). After each iteration, a gap analysis decides whether to continue or synthesize.

Every fifth delegate completion (or at the final iteration), a challenger (Grok-4-Flash) activates to pressure-test assumptions with alternative angles. This red-teaming ensures the answer is robust, not just comprehensive.

Finally, the synthesis model (GPT-5 Mini) receives all the sanitized delegate transcripts and produces a long, self-contained answer. The answer stands alone even if delegate outputs were removed—it integrates and expands on them, not just summarizes.

Leemer Heavy (Fast): Debate Synthesis System

Two models exchange ideas before synthesis for rapid, multi-perspective answers

Gemini 2.5 Flash Lite

Proponent

Presents initial thoughts

Qwen3 Next 80B

Challenger

Counter-argues

Gemini 2.5 Flash Lite

Proponent Reply

Refines position

Qwen3 Next 80B

Challenger Final

Final counterpoint

Kimi Linear 48B

Synthesis

Integrates best insights

Gemini 2.5 Flash Lite

Proponent

Presents initial thoughts

Qwen3 Next 80B

Challenger

Counter-argues

Gemini 2.5 Flash Lite

Proponent Reply

Refines position

Qwen3 Next 80B

Challenger Final

Final counterpoint

Kimi Linear 48B

Synthesis

Integrates best insights

Leemer Heavy (Fast): Rapid Debate Synthesis

Heavy (Fast) takes a completely different approach. Instead of iterative research loops, it uses a structured debate system where two models exchange ideas before synthesis.

The debate follows a fixed 4-stage pattern: Proponent (Gemini 2.5 Flash Lite) presents initial thoughts → Challenger (Qwen3 Next 80B) counter-argues → Proponent refines position → Challenger provides final counterpoint. Optionally, a pre-debate research snapshot can provide context, but the focus is on rapid multi-perspective exploration.

After the debate, a synthesis model (Kimi Linear 48B) integrates the best insights from both sides into a comprehensive answer. The synthesis doesn't just summarize—it builds upon the debate to create a richer, more complete answer than either side alone.

This approach is faster (30-90 seconds vs 2-4 minutes) and uses fewer delegate calls (4-5 fixed vs up to 20+), making it ideal for questions that benefit from multiple perspectives but don't require deep research iteration.

Heavy vs Heavy (Fast): When to Use Each

Understanding the trade-offs between depth and speed

Feature	Leemer Heavy	Heavy (Fast)
Primary Approach	Iterative research orchestration	Rapid debate synthesis
Orchestrator	GPT-5 Mini	Direct debate flow
Research	Perplexity Sonar (iterative)	Optional pre-debate snapshot
Reasoning	GLM-4.6 (deep analysis)	Built into debate stages
Synthesis Model	GPT-5 Mini	Kimi Linear 48B
Max Iterations	6 (configurable)	Fixed 4-stage debate
Best For	Deep research, comprehensive analysis	Quick multi-perspective answers
Budget	270 seconds	180 seconds

Performance Characteristics

How each model performs under different conditions

Response Time

Heavy2-4 minutes

Fast30-90 seconds

Delegate Calls

HeavyUp to 20+

Fast4-5 fixed

Budget

Heavy270 seconds

Fast180 seconds

Complexity

HeavyHigh (iterative)

FastMedium (structured)

The Technical Stack

Both models share a common orchestration core (`leemer-orchestrator-core.ts`) that provides timeout management, streaming control, telemetry recording, and delegate registry. This shared foundation ensures consistency and makes it easy to add new delegates or modify workflows.

The delegate registry pattern allows each specialist to be a pure async handler that accepts arguments and a delegate context. Delegates emit timing metadata in their tool events, which the UI uses to display start times and durations.

Context management is critical. The system maintains rolling context windows, sanitizes delegate text to remove tool-signature tokens, and deduplicates queries. Continuation heuristics detect truncated answers and automatically patch them using the delegate trail.

All configuration is centralized in `leemer-config.ts`, which loads model IDs, timeouts, feature flags, and budgets from environment variables. This makes it easy to tune performance or swap models without changing core logic.

Why This Architecture Works

The union model approach solves a fundamental problem: no single model is best at everything. GPT-5 Mini is excellent at orchestration and synthesis, but Perplexity Sonar has better web search capabilities. GLM-4.6 excels at stepwise reasoning, while Qwen-3-235B is better at long-form refinement.

By letting the orchestrator decide when to delegate, we get the right tool for the right job. Simple questions might only need synthesis. Complex research questions trigger the full iterative loop. Questions needing multiple perspectives use the debate system.

The architecture is also extensible. New delegates can be added to the registry without changing core orchestration logic. Feature flags control optional behaviors (like debate injection in Heavy or pre-debate research in Fast). Timeout budgets prevent runaway costs while allowing flexibility.

Most importantly, the system is transparent. Users can see delegate activity in the UI (collapsed by default), and telemetry provides full observability. This transparency builds trust and helps debug issues when they arise.

The Future of Union Models

Union model architectures represent a new paradigm in AI: instead of building bigger monolithic models, we're building smarter orchestrators that know when to delegate. This approach is more efficient, more transparent, and more capable than any single model alone.

— Repath Khan

Founder, LeemerChat