Running Claude on Your Laptop: A Wardley Map of When Local LLMs Will Replace the API

February 11, 2026 Dan Gurgui Comments Off

The renter problem: why cloud LLMs feel inevitable (until they don’t)

If you work with AI in any serious capacity, you’re probably sending requests to an API. Claude, GPT, Gemini. You paste in your context, you get a response, and you pay per token. Or you use Claude Code in your terminal and it burns through your API quota while editing files, running tests, and reasoning through your codebase.

This works. It works well. But you’re also paying $5-20 per day for heavy sessions. Teams doing serious agentic work report $500+ monthly per developer. Every file, every prompt, every architectural decision passes through someone else’s infrastructure.

The dependency runs deeper than cost. According to a recent analysis of Claude Desktop incidents, Anthropic’s own desktop client experienced 19 incidents in 14 days during late January 2026, including a memory leak shipped to production. When Anthropic has a bad day, your workflow stops. No internet, no AI assistant. Plane, train, or bad hotel Wi-Fi? You’re back to vanilla VS Code.

We’re renters. And like renting, it’s convenient until you realize how much of your workflow depends on someone else’s infrastructure.


Define the bar: what “Claude-quality coding agent” actually means

What I actually want is simple to describe and hard to deliver: I want to run a Claude-quality coding agent on my laptop, fully offline, with no API keys, no token costs, and no data leaving my machine.

Not a toy. Not a chatbot that sort of writes code. I want the full experience: agentic coding where it reads my repo, edits files, runs tests, and iterates on errors. Tool use with terminal commands, file operations, and web search when online. Long context where I can feed it my entire project and it holds coherence. Reasoning quality where it doesn’t just autocomplete but actually thinks through architecture decisions.

Before we can talk about closing the gap, we need to define the target. Claude Opus 4.6, released February 2026, is the current frontier for agentic coding: 80.8% on SWE-bench Verified, ~40 tok/s output speed, 200K+ tokens with coherence, and multi-step tool chains with error recovery.

Model SWE-bench Verified Parameters Can run locally?
Claude Opus 4.6 80.8% Unknown (cloud) No
GPT-5.2 80.0% Unknown (cloud) No
Kimi 2.5 76.8% 1T MoE (32B active) Marginal—needs ~64GB+
Llama 3.3 70B ~65% 70B Yes (128GB, 4-bit)
DeepSeek Coder 33B ~63% 33B Yes (32GB, 4-bit)

The frontier models are converging in the 80% range on SWE-bench. That’s the bar local models need to reach. The real gap isn’t 76.8% vs 80.8%. It’s that most models you’d actually run locally land at 63-65%.


The Wardley Map: the value chain from developer need to hardware

A Wardley Map helps us see the landscape clearly. It plots components along two axes: visibility (how close the component is to the user’s actual need) and evolution (how mature the component is, from genesis to commodity).

The map covers every component in the “running a Claude-quality coding agent locally” value chain. Components in red are critical gaps. Orange components are evolving. Green ones are already mature. Blue boxes show where each component is projected to be by late 2026 or mid 2027.

Here’s what the value chain looks like from top to bottom:

User-facing (high visibility):

  • Cloud Claude API: far right, commodity service
  • Local coding agent experience: the goal we’re mapping toward

Model quality layer (red zone):

  • Reasoning quality: 15-point gap to close
  • Tool use reliability: multi-step chains break down
  • Long context: claimed vs actual effectiveness

Performance layer (orange zone):

  • Inference speed: 10 tok/s vs 40+ tok/s
  • Fine-tuning/personalization: still genesis-stage

Infrastructure layer (green zone):

  • Inference engines (Ollama, llama.cpp, MLX): mature
  • Open model weights: abundant
  • Quantization tools: commoditized

Hardware foundation:

  • LLM inference box: emerging category
  • Laptop hardware: M4 Max today, M5 coming
  • Unified memory: 128GB ceiling, 256GB needed

The map tells us where to focus. The red zone is where the experience breaks.


Critical gaps (today): quality, tool reliability, long-context that actually works

The reasoning gap

The best open model for coding is Kimi 2.5 from Moonshot AI, hitting 76.8% on SWE-bench Verified. It’s a 1 trillion parameter MoE model that activates 32B parameters per token. Its “Agent Swarm” technology coordinates up to 100 sub-agents, reducing runtime by ~80% for complex tasks.

But Kimi 2.5 needs ~64GB+ to run locally. Most models you’d actually run on a laptop land at 63-65%. That’s a 15-point gap from Claude’s 80.8%.

In practice, this gap shows up when you ask a local model to refactor a service that touches three other services. Claude traces the dependencies, identifies the breaking changes, and suggests a migration path. A 65% model gets confused about which service owns which responsibility and proposes changes that would break the system.

Tool use that falls apart

Ollama 0.14+ supports the Anthropic Messages API, which means you can literally point Claude Code at a local model. The problem? The model behind the API matters enormously.

Claude Code’s agentic loop assumes a model that can parse complex tool schemas, handle multi-turn tool chains without losing track, recover gracefully from tool errors, and know when to stop iterating. Most open models handle the first point. They struggle badly with the last three.

I’ve watched local models enter infinite loops, re-running the same failing test without changing their approach. The “muscle memory” of knowing when to re-run a test vs when to stop and ask the user is hard to train without massive agentic datasets.

Context that degrades

Models like Llama 3.3 and Qwen 2.5 claim 128K token context. In practice, for coding tasks, quality degrades noticeably past 32K tokens. “Lost in the middle” effects persist. Claude maintains coherence across 200K+ effective tokens.

When you’re feeding a local model your entire project to understand an architectural issue, it might nail the files at the beginning and end of the context window while completely missing the critical service definition at token 50K.


Performance reality: why 10 tok/s is fine—until you go agentic

Here’s what you get today on an M4 Max (128GB) with different models:

Model Quantization Tokens
/sec
Feels like…
Llama 3.3 70B 4-bit (Q4_K_M) ~11.8 Usable. Typing speed.
Qwen 2.5 72B 4-bit (MLX) ~10.9 Usable with MLX optimization.
GLM-4.7 Flash 4-bit ~45+ Fast. Near-instant for short responses.

For pure code generation, 10 tok/s is fine. You’re not reading faster than that. But agentic workflows are different.

When Claude Code reasons through a task, it might read 5 files (10K input tokens), generate a plan (500 output tokens at ~45 seconds), edit a file and run a test (300 tokens, ~27 seconds), read the test output and reason about the failure (400 tokens, ~36 seconds), then fix the issue and re-run (300 tokens, ~27 seconds).

That’s ~2.5 minutes for a simple edit-test-fix cycle on a 70B model locally. The same workflow on cloud Claude takes ~15-30 seconds because Anthropic’s inference runs at 50-80 tok/s on specialized hardware.

For a single cycle, 2.5 minutes is tolerable. For 10 iterations on a complex bug? You’re looking at 25 minutes of waiting. That’s where the experience breaks.

What needs to happen: speculative decoding, better KV-cache management, and hardware improvements. MLX is already showing 20-40% speedups over generic llama.cpp on Apple Silicon. Target for late 2026 is 25-30 tok/s for 70B models.


The hardware escape hatch: the local inference box as the near-term win

There’s a third path that doesn’t get discussed enough: what if the model doesn’t run on your laptop, but next to it?

A dedicated inference box—a small, purpose-built device that sits on your desk and serves models over a high-speed local connection. Your laptop sends requests over Thunderbolt, 10GbE, or USB4, and gets responses back with sub-millisecond network latency.

This is already happening. An M4 Ultra Mac Studio with 192GB unified memory running Ollama, accessed by your MacBook over local network, gives you 100B+ model quality without laptop thermal constraints. Custom GPU servers with dual RTX 4090 (48GB VRAM) serve models over LAN at 30+ tok/s for $3-5K build cost.

The LLM box concept changes the map. It moves the hardware constraint from “commodity laptop” to “commodity appliance,” which is a much easier evolution. An inference box doesn’t need a screen, battery, keyboard, or portability. It just needs memory, compute, and cooling.

The tradeoff: You lose true offline portability (no AI on the plane), but you gain dramatically better performance for desk-based work, which is 90% of most developers’ time. If you’re already spending $500+/month on API costs, a $3-5K inference setup pays for itself in 6-10 months.

Interestingly, Anthropic is moving toward local/hybrid deployment themselves, with Claude Cowork expanding to Windows in February 2026. This validates that the industry sees the direction.


Forecast: what changes by late 2026 vs mid-2027 (and what won’t)

Late 2026: Competitive for most coding tasks

If current trends hold, open models hit 78-80% SWE-bench locally (we’re already at 76.8% with Kimi 2.5). Tool use becomes reliable for multi-step agents as multiple labs invest in agentic training. Long context reaches 128K+ usable tokens. M5 hardware or LLM boxes push 25-30 tok/s on 70B+ models.

This is the point where you might use local for 70-80% of your coding tasks and only reach for cloud Claude for the hardest problems.

Mid 2027: Near-parity for individual developers

Two components still evolving: fine-tuning/personalization where your local assistant learns from your codebase, and unified memory at 256GB on M5 Ultra making 100B+ models comfortable. Combined with late 2026 improvements, this is where the dream gets real.

The hybrid reality

The future isn’t local OR cloud. It’s intelligent routing where simple tasks go local and complex reasoning goes to cloud. A refactor? Local. A novel architecture decision requiring frontier reasoning? Cloud.

For enterprises, the calculus is different. Banks and healthcare companies may prefer local LLMs for compliance and data sovereignty reasons even if quality is 10-15% lower. The tradeoff between capability and control looks different when you’re handling regulated data.

The honest caveat

Frontier labs aren’t standing still. Every time open models close a gap, the frontier moves. The question isn’t whether local models will match today’s Claude—they will. The question is whether they’ll match future Claude. The strategic bet is that for most coding workflows, “good enough” arrives before “perfect” matters.


Practical playbook: how to prepare without betting the farm

  1. Try it now. Install Ollama, pull qwen3-coder:32b, and point Claude Code at it via ANTHROPIC_BASE_URL. Feel the gap firsthand.
  2. Watch the benchmarks. Track SWE-bench for open models. When a 30-70B model hits 75%+ on SWE-bench, the math changes.
  3. Budget for hardware. If you’re buying a new laptop, get the 128GB configuration. Future-proof for local LLMs.
  4. Consider the LLM box. If you’re spending $500+/month on API costs, a $3-5K inference setup pays for itself in under a year.
  5. Design for swap-ability. Use the Anthropic Messages API format in your tooling. Swap between local and cloud with an environment variable, not a rewrite.

The components on the left of the Wardley Map need to move right. When they do, you’ll know it’s time to shift.


Further Reading


Dan Gurgui | A4G
AI Architect

Weekly Architecture Insights: architectureforgrowth.com/newsletter