Back to blog

Launch notes

LeemerGLM: our Gemma 3 4B multimodal expert

Today we are opening the doors to LeemerGLM, a discovery-first model built on Gemma 3 4B and paired with a vision specialist so it can read the web, interpret screens, and answer with grounded, testable steps.

BackboneGemma 3 4B
ModeText + Vision
Context128K tokens
LatencySub-1s targets
Why Gemma 3 4B

We benchmarked frontier and open models for latency, safety, and tool-handling. Gemma 3 4B hit the sweet spot: small enough for responsive inference on our clusters, aligned enough to avoid hallucinated links, and flexible enough to fine-tune on our expert traces. It is also fully multimodal, so we can ship a single brain that reasons over pixels and tokens together.

The result is an expert that feels like an on-call teammate. Ask it to inspect a graph, critique onboarding copy, or reason through search results; LeemerGLM will combine the visuals with retrieved text and cite the sources it trusts.

Gemma 3 for stability

We picked Gemma 3 4B as the spine because it is small enough to serve instantly yet trained with frontier-scale safety, retrieval, and tool-use priors. It gives us a balanced reasoning core that refuses to hallucinate under load.

Multimodal by default

LeemerGLM ingests screenshots, product docs, graphs, and code snippets without leaving the flow. Vision is not a bolt-on; it is how the model reasons about interfaces, diagrams, and noisy real-world data.

Expert orchestration

Inside LeemerChat, LeemerGLM sits beside Grok-4.1, GPT-5.1, and Gemini 3 Pro. Our router chooses the right expert for each step: Grok for speed, Gemini for world knowledge, GPT-5.1 for longform reasoning, and LeemerGLM for grounded multimodal synthesis.

How we built it

Prototype: text-only control

We fine-tuned Gemma 3 4B on our routing traces to teach refusal patterns, citation-heavy answers, and low-latency search planning. This gave us a stable text specialist that could be trusted as the default brain for UI assistance.

Vision alignment

Next we co-trained on product screenshots, debugging traces, and Figma exports. The goal was fast layout recognition and the ability to narrate what matters on a canvas—buttons, errors, and user flows—without verbose noise.

Expert fusion

Finally we wired LeemerGLM into our expert router. It now pairs with Perplexity Sonar for retrieval, hands code to Grok-4.1-Fast, and defers deep synthesis to GPT-5.1. Each expert returns citations so the fusion layer can reconcile answers transparently.

What happens at /api/leemer-glm

The production route mirrors the marketing promise from /leemer-glm: authenticated access, transparent expert routing, and fast token streaming. We documented the flow so you can plug the endpoint straight into your product.

Authenticated by design

The /api/leemer-glm route checks for a signed-in Clerk session before work begins. That protects GPU time and keeps your conversations scoped to your account so history, preferences, and files stay private.

API pricing + base model

Public API access is priced at $0.10 input / $0.30 output per million tokens for leemerchat/leemer-glm, which runs on our GLM-4.1-9B base so responses stay multimodal by default.

Streaming router events

Responses are streamed as server-sent events. The first payload includes the router's chosen experts and the reasoning behind the selection, followed by the synthesis stream from the orchestrator.

Graceful failure modes

If the synthesis stream is unavailable, the route emits a structured error event before closing the connection. We also guard against missing OpenRouter credentials, returning a clear 503 so clients can retry elsewhere.

API flow

  1. 1Authenticate with Clerk so the orchestrator can link responses to your workspace.
  2. 2POST a LeemerGLMRequest payload to /api/leemer-glm with your prompt, vision inputs, and any tool context.
  3. 3Listen for the initial router event to see which experts were selected and why.
  4. 4Stream the synthesis body to render tokens live; if the stream ends early, show the error payload for context.

When to call LeemerGLM

Use case

Product reconnaissance

Upload a dashboard screenshot and ask for the three fastest UX improvements. LeemerGLM identifies drop-off points, highlights mismatched typography, and drafts component-level fixes with Tailwind-ready code.

Use case

Data story crafting

Paste a CSV preview and a chart image. The model cross-references both, suggests a narrative arc, and drafts a one-page memo with footnotes so you can ship executive updates without editing.

Use case

Incident triage

Share a stack trace plus a screenshot of the failing page. LeemerGLM outlines probable causes, routes code suggestions to Grok, and hands back a concise runbook you can paste into PagerDuty notes.

Use case

API-first prototyping

Call the /api/leemer-glm endpoint directly from your product. The SSE stream surfaces the chosen experts and reasoning so you can log, debug, and replay responses without guessing what happened inside the orchestrator.