Skip to content

Lumen vs Alternatives

A technical comparison of Lumen against other browser agent frameworks. All frameworks were analyzed from source code in this repository.

At a Glance

LumenStagehandbrowser-useSkyvernMagnitude
LanguageTypeScriptTypeScriptPythonPythonTypeScript
ApproachVision-onlyHybrid (DOM + Vision + A11y)Hybrid (DOM-first + Vision)Vision + PlaywrightVision-only
BrowserCDP (raw WebSocket)CDP + Playwright/PuppeteerCDP (cdp-use)PlaywrightCDP
LLM ProvidersAnthropic, Google, OpenAI, Custom15+ via @ai-sdk/*10+ (OpenAI, Anthropic, Google, Groq, Ollama...)OpenAI, Anthropic, Google, BedrockAnthropic, Google
Context MgmtDual history + 2-tier compactionAgent cache + conversation replayMessage compaction (summarization)Workflow-levelScreenshot compression
Stall Detection3-layer repeat detector + escalating nudgesCache-based self-healingRolling window (20 actions) + page stagnation hashWorkflow retriesNone documented
Post-Action VerificationActionVerifier (heuristic CDP checks)NoneDefaultActionWatchdog (scroll retry)NoneNone
Site-Specific KnowledgeSiteKB (domain-matched tips)NoneNoneNoneNone
Session ResumeFull serialize/deserialize (wire + semantic + state)Conversation messages + cache replayBrowser profile + storage stateWorkflow persistenceNone documented
SafetyPolicy + PreActionHook + Verifier gatesNone built-inDomain whitelist watchdogAuth + domain controlNone documented
Codebase Size~5K LOC~17K LOC (published)~68K LOC~50K+ LOC~3K LOC (core)
Key Dependencysharp, ws, provider SDKs@ai-sdk/*, pino, zodcdp-use, bubus, pydantic, Pillowplaywright, pydanticProvider SDKs

Page Understanding

The most fundamental architectural difference between these frameworks is how they understand web pages.

Lumen: Vision-Only

Lumen sends only screenshots to the model. No DOM parsing, no accessibility tree, no selectors.

Advantages:

  • Simplest implementation — no brittle DOM extraction logic
  • Works identically across all page types (SPAs, canvas, iframes, shadow DOM)
  • No risk of stale DOM snapshots
  • Model sees exactly what a human would see

Tradeoffs:

  • Requires strong vision models (Claude Sonnet 4+, Gemini 2.0+)
  • Cannot leverage structural hints (element indices, ARIA labels) for disambiguation
  • Higher per-step token cost from screenshot encoding (~40-100KB base64 each)

Stagehand: Hybrid (DOM + Vision + Accessibility)

Stagehand supports three modes: DOM-only, hybrid (vision + DOM), and CUA (computer-use). The DOM mode builds an accessibility tree via the browser's ARIA API, giving the model structured element information alongside optional screenshots.

Advantages:

  • Works with cheaper, non-vision models in DOM mode
  • Accessibility tree provides semantic structure that helps with ambiguous elements
  • Self-healing cache: if a selector breaks, re-discovers the element and replays
  • act() / extract() / observe() API is intuitive for common tasks

Tradeoffs:

  • DOM extraction adds complexity and latency
  • Accessibility tree can miss dynamically-generated or heavily-obfuscated content
  • Three separate code paths (DOM / hybrid / CUA) increase maintenance surface
  • Cannot switch modes mid-execution

browser-use: Hybrid (DOM-first + Vision fallback)

browser-use parses the DOM into indexed interactive elements ([1] <button>Click me</button>), with optional screenshots for vision models.

Advantages:

  • DOM indices give the model precise element targeting
  • Paint-order filtering removes occluded elements
  • Vision is optional — can run with text-only models
  • Fine-tuned ChatBrowserUse model optimized for the DOM format (3-5x speedup claimed)

Tradeoffs:

  • DOM indices shift when page content changes (stale index problem)
  • DOM serialization is expensive on large pages
  • Python-only ecosystem

Skyvern: Vision + Playwright

Skyvern uses Playwright for browser control with AI-powered commands (page.act(), page.extract(), page.validate()).

Advantages:

  • Cross-browser support (Chrome, Firefox, WebKit) via Playwright
  • Built-in workflow engine with loops, conditionals, and integrations
  • Authentication support (Bitwarden, 2FA/TOTP)
  • No-code workflow builder UI

Tradeoffs:

  • Playwright dependency is heavier than raw CDP
  • Tightly coupled to cloud service for advanced features

History & Context Management

Long-running tasks (20+ steps) inevitably exhaust the model's context window. How each framework handles this determines how well it scales.

Lumen: Dual History + 2-Tier Compaction

Lumen maintains two parallel histories:

  1. Wire history — fed to the model. Compressed aggressively via:
    • Tier-1: Screenshot base64 nulled out (keeps last N frames, default 2)
    • Tier-2: At 80% context utilization, LLM summarizes the entire history into a single block
  2. Semantic history — never compressed. Full screenshots, actions, outcomes, tokens, timing. Used for debugging and audit.

TaskState (writeState action) persists structured JSON that survives compaction — re-injected every step.

Token savings are quadratic in task length: short tasks see ~5% reduction, long tasks (20+ steps) routinely exceed 40%.

Stagehand: Cache + Conversation Replay

Stagehand stores AgentReplayStep[] sequences that can replay without LLM calls on subsequent runs. For ongoing sessions, conversation messages (ModelMessage[]) carry the full conversation forward.

No automatic context compression — relies on the underlying model's context window (Gemini's 1M tokens helps).

browser-use: Message Compaction

Triggered when history exceeds ~40K chars. Uses a cheaper LLM to summarize old steps into memory blocks, keeping the last 6 items verbatim. Configurable trigger thresholds.

Closest to Lumen's approach, but:

  • Single-tier (no separate screenshot compression)
  • No immutable audit trail (compacted history is the only history)
  • Agent state tracked via memory field in the agent state machine, not a separate persistent store

Skyvern: Workflow-Level

Skyvern manages state at the workflow level rather than the individual agent step level. Each task in a workflow runs semi-independently.


Stall & Loop Detection

A critical reliability feature: what happens when the model gets stuck repeating the same action?

Lumen: 3-Layer RepeatDetector

Three detection layers with escalating nudges:

LayerWhat it detectsMechanism
Action-levelSame click/scroll repeatedSHA-256 hash of normalized action (64px coordinate bucketing) in rolling 20-action window
Category-levelScroll/noop patterns interleavingClassifies actions as productive/passive/noop; triggers when non-productive category dominates
URL-levelStuck on same page too longTracks steps per URL (normalized to origin+pathname to ignore tracking params)

Nudges are injected into the system prompt and escalate:

  • Level 5: "Try something different"
  • Level 8: "WARNING: Save progress, try keyboard navigation"
  • Level 12: "CRITICAL STRATEGY RESET: Change approach NOW"

Nudges are sticky — they persist until the model takes a productive action (for action nudges) or navigates away (for URL nudges).

When nudges reach level 8+, CheckpointManager can restore the browser to a previous state (URL + scroll position + agent state) instead of just nudging.

Stagehand: Cache-Based Self-Healing

If a cached action sequence fails (e.g., selector changed), Stagehand re-invokes the AI to discover the new selector. Not a loop detector per se, but prevents one class of stuck behavior.

No explicit mechanism for detecting repeated model-generated actions in non-cached execution.

browser-use: Rolling Window Detection

Similar to Lumen's action-level detection: SHA-256 hashes in a rolling window (default 20 actions) with page stagnation tracking (URL + element count + text hash). Nudges at 5, 8, 12 repetitions.

Also includes a DefaultActionWatchdog that automatically scrolls before retrying a failed action.

Skyvern: Workflow Retries

Task-level retries with configurable retry counts. No step-level stall detection within a task.


Action Execution Model

Lumen: Streaming Mid-Stream Execution

Actions are executed as soon as each tool call block completes in the model's response stream. This reduces latency compared to waiting for the full response.

Buffered outcomes are replayed into wire history after the assistant turn is committed, maintaining correct message format. Post-action delays are configurable (click: 200ms, type: 500ms, scroll: 300ms, navigation: 1000ms).

Form state extraction: After form-related actions, Lumen evaluates a CDP script to extract visible input values and feeds them back to the model as a nudge. This helps the model verify that form inputs were actually set correctly.

Stagehand: Tool-Calling Loop

Standard tool-calling pattern via the ai SDK's generateText(). 17 tools available, including DOM-specific ones (act, fillForm, extract).

Two execution styles:

  • High-level: stagehand.act("click the login button") — LLM figures out the selector
  • Deterministic: stagehand.act({ selector: "#btn", method: "click" }) — no LLM needed

browser-use: Registry-Based

Actions are registered in a tool registry with event emission. 18 core actions including file operations (upload_file, save_as_pdf) and tab management (switch_tab, close_tab).

14 watchdog services coordinate via an event bus for error handling, CAPTCHA detection, crash recovery, etc.

Skyvern: Playwright Actions

Wraps Playwright's action methods with AI-powered element targeting. Supports standard Playwright actions plus AI-specific ones (act, extract, validate).


Safety & Policy

Lumen

Four layers of safety, composable:

  1. SessionPolicy — declarative allow/blocklist for domains (glob patterns) and action types
  2. PreActionHook — async callback before every action, can deny with reason
  3. ActionVerifier — heuristic post-action checks (click target, input focus, goto host) via CDP, no API cost
  4. Verifier gatesUrlMatchesGate, CustomGate, ModelVerifier verify task completion before accepting terminate

Blocked actions are fed back to the model as errors — the loop continues.

Stagehand

No built-in safety layer. Domain restrictions and action filtering must be implemented in application code.

browser-use

SecurityWatchdog enforces a domain whitelist. Sensitive data detection with regex-based redaction.

Skyvern

Authentication support (Bitwarden, TOTP). Domain control via workflow configuration.


Multi-Provider Support

ProviderLumenStagehandbrowser-useSkyvern
Anthropic (Claude)Native CUAVia @ai-sdkChatAnthropicDirect
Google (Gemini)Native CUAVia @ai-sdkChatGoogleDirect
OpenAINative CUAVia @ai-sdkChatOpenAIDirect
GroqVia CustomVia @ai-sdkChatGroqNo
OllamaVia CustomVia @ai-sdkChatOllamaVia OpenRouter
Custom/OpenAI-compatCustomAdapterVia @ai-sdkMultipleVia OpenRouter
AzureVia CustomVia @ai-sdkYesDirect
AWS BedrockNoVia @ai-sdkNoDirect
Total4 native + custom15+10+8+

Lumen focuses on deep integration with 4 providers (native computer-use protocol for each) plus a custom fallback. Stagehand and browser-use cast a wider net with more provider adapters.


Browser Integration

LumenStagehandbrowser-useSkyvern
ProtocolRaw CDP WebSocketCDP + Playwright/PuppeteerCDP (cdp-use)Playwright
Local Chromechrome-launcherchrome-launcher / Patchrightchrome-launcherPlaywright chromium
Cloud BrowserBrowserbaseBrowserbaseBrowser Use CloudCloud offering
Cross-browserChrome onlyChrome (+ Firefox/WebKit via Playwright)Chrome onlyChrome, Firefox, WebKit
Viewport AlignmentAuto-align to model patch sizeNoNoNo
Stealth/Anti-detectNoNoCloud mode onlyCloud mode

Lumen's viewport alignment is unique: it snaps the viewport to the model's optimal patch size (e.g., multiples of 28px for Anthropic) to minimize rounding error in coordinate outputs.


Observability

Lumen

  • LumenLogger: 6 surface-specific debug channels (LUMEN_LOG_CDP, LUMEN_LOG_ACTIONS, LUMEN_LOG_BROWSER, LUMEN_LOG_HISTORY, LUMEN_LOG_ADAPTER, LUMEN_LOG_LOOP)
  • LoopMonitor: Event callbacks for step/action/compaction lifecycle
  • StreamingMonitor: Typed StreamEvent async iterable for real-time UIs
  • Semantic history: Immutable audit trail with full screenshots, always available regardless of compaction

Stagehand

  • pino structured logging
  • stagehandMetrics token tracking
  • Event emitter for screenshots during execution

browser-use

  • rich terminal formatting
  • Watchdog event bus (14+ event types)
  • PostHog telemetry (opt-in)
  • HAR recording for network debugging

Skyvern

  • Livestreaming browser viewport to local machine
  • Workflow execution logs
  • Video/screen recording

Session Resumption

LumenStagehandbrowser-useSkyvern
Serialize to JSONFull roundtrip (wire + semantic + state + model ID)Partial (conversation messages)NoWorkflow state
Cross-process resumeAgent.resume(snapshot, opts)Via resumeSessionIdBrowser profile reuseWorkflow continuation
What survivesHistory, agent state, model contextCache entries, Browserbase sessionCookies, localStorageWorkflow variables
What doesn'tBrowser state (must reconnect)In-memory stateIn-memory agent stateBrowser state

Lumen's serialization is the most complete — it captures both the compressed wire history (what the model sees) and the full semantic history (what you can audit), plus any structured state written by the model.


API Design

Lumen: Facade + Session

typescript
// High-level (recommended)
const result = await Agent.run({ model: "anthropic/claude-sonnet-4-6", browser: { type: "local" }, instruction: "..." });

// Multi-run
const agent = new Agent({ model: "...", browser: { type: "local" } });
await agent.run({ instruction: "Step 1" });
await agent.run({ instruction: "Step 2" });

// Streaming
for await (const event of agent.stream({ instruction: "..." })) { ... }

// Low-level
const session = new Session({ tab, adapter, maxSteps: 20 });
const result = await session.run({ instruction: "..." });

Stagehand: Method-per-action

typescript
const stagehand = new Stagehand(options);
await stagehand.init();
await stagehand.act("click the login button");
const data = await stagehand.extract("get the product name", schema);
const elements = await stagehand.observe("find all links");

browser-use: Task-based

python
agent = Agent(task="Find the weather", llm=ChatAnthropic(...), browser=browser)
result = await agent.run(max_steps=500)

Skyvern: Command-based

python
await page.act("Click the login button")
data = await page.extract("Get product name", schema={...})
is_done = await page.validate("Check if logged in")

Benchmark

Lumen includes a WebVoyager evaluation (evals/webvoyager/run.ts) that runs tasks across Lumen, Stagehand, and browser-use on live websites. Matches Stagehand's evaluation methodology: Gemini 2.5 Flash judge, same prompt, 3 trials per task, 50 max steps.

Subset of 25 tasks from WebVoyager, stratified across 15 sites. All frameworks use Claude Sonnet 4.6.

MetricLumenbrowser-useStagehand
Success Rate25/25 (100%)25/25 (100%)19/25 (76%)
Avg Steps (all)14.48.823.1
Avg Steps (passed)14.48.815.7
Avg Time (all)77.8s109.8s207.8s
Avg Time (passed)77.8s136.0s136.0s
Avg Tokens104KN/A200K

Lumen runs with SiteKB (domain-specific navigation tips) and ModelVerifier (termination gate) enabled.

bash
npm run eval              # 25 tasks (default)
npm run eval -- 5         # 5 tasks

When to Use What

Use caseRecommendedWhy
Multi-step workflows (20+ steps)Lumen2-tier compaction + writeState checkpointing keep context efficient for long tasks
Quick DOM scraping / extractionStagehandextract(instruction, schema) API is purpose-built; DOM mode is fast and cheap
Enterprise with many integrationsbrowser-use14 watchdogs, MCP support, fine-tuned model, extensive provider support
No-code / workflow builderSkyvernVisual workflow editor, authentication support, cloud infrastructure
Safety-critical automationLumenPolicy + hooks + gates provide layered defense; full audit trail
Non-vision models (GPT-3.5, Llama)Stagehand or browser-useDOM/a11y modes work without vision capabilities
Multi-provider comparisonLumenNative adapters for Anthropic/Google/OpenAI with unified Action format
Python ecosystembrowser-use or SkyvernLumen and Stagehand are TypeScript-only
Maximum reliabilityLumenRepeat detection, form state extraction, completion gates, and retry backoff cover common failure modes
Cross-browser testingSkyvernPlaywright enables Firefox and WebKit alongside Chrome

Architecture Tradeoffs Summary

DecisionLumen's choiceAlternative approachTradeoff
Page understandingVision-onlyDOM + Vision hybridSimpler, universal, but needs strong vision models and costs more tokens per step
Coordinate spacePixel coords (decoded per provider)Normalized 0-1000 everywhereZero conversion overhead in the hot path, but requires per-provider decode logic
HistoryDual (wire + semantic)Single compacted history2x memory for screenshots, but full audit trail always available
CompactionProactive at 80% utilizationReactive at limitExtra LLM call cost, but avoids context pressure surprises
Action executionStreaming mid-streamBatch after full responseMore complex buffer management, but lower latency
State persistencewriteState action (last-write-wins)Agent memory fieldSimple semantics, but no versioning or branching
BrowserRaw CDPPlaywrightLighter weight, but Chrome-only
Safety3-layer (policy + hook + gate)None / application-levelBuilt-in overhead, but composable defense-in-depth