All primers
Local & Open Models Intermediate

Local coding on an M4 Mac: Ollama + Devstral

When a local model earns its slot. The full setup on Apple Silicon — Ollama, Devstral Small 24B, Metal acceleration, and the editor integrations that make local coding stop feeling like a demo.

February 4, 2026 · 12 min read · Last verified April 17, 2026

Let me get the honest framing out of the way first, because every “run your own coding model” post I’ve read skips it and the result is you install something, try it for fifteen minutes, notice it’s not Claude, and uninstall.

Local coding on a Mac, in early 2026, is not a Claude-replacement. It is not even close. A 24B-parameter model running on your laptop cannot out-reason a frontier API model on real codebases, cannot plan multi-file refactors as well, and cannot hold a candle to Claude Code or a GPT-class agent when the task needs tool use at any depth. If you set up local hoping for that, you will be disappointed.

This primer is not about that.

This primer is about the narrow and real set of workloads where a locally-run coding model wins — where it’s measurably better than reaching for the API, not worse — and the exact setup to get there on an M4-class Mac. When you know what local is for, you stop comparing it to Claude on refactor tasks it was never going to win, and you start using it for the things it quietly dominates.

When local actually wins

Four cases. These are the ones I keep coming back to:

1. Privacy-hard workloads. Health data. Legal discovery. NDA’d client code. Internal repos where “do not send to a third party” is in the contract or the compliance doc. In these cases it doesn’t matter how much better Claude would be — you can’t use Claude. Local isn’t a preference; it’s the only option. A 24B model you control end-to-end is the ballgame.

2. Offline. Flights. Trains through tunnels. A hotel in a part of the world where the WiFi is, charitably, aspirational. I have shipped real code on a local model at 35,000 feet more than once in the last year. That’s not an edge case when you travel.

3. Tight loops where latency dominates quality. Inline completion as you type. Docstring generation. “Write me a regex that matches this pattern.” Small, scoped, fast-turnaround tasks where the incremental quality of a frontier model matters less than the sub-100-millisecond time to first token. API calls have a floor latency — DNS, TLS, routing — that a local model eliminates. For completions that fire ten times per minute, that floor is the whole experience.

4. Cost pressure on lots of small queries. Not the one big reasoning task — API Claude wins that every time on a cost-per-outcome basis. I mean the thousand-query workload: scanning a codebase to generate summaries, running a linter-adjacent check across every file, batch-classifying a corpus. Once the hardware is paid for, the marginal cost of a local query is electricity and heat. If you’re running enough of these, local pays for itself fast.

Notice what’s missing from the list: “deep reasoning over unfamiliar code,” “multi-file agentic refactor,” “anything with a long tool-use loop.” Those aren’t local-wins in early 2026. Don’t force them.

What M4 machines actually deliver

Apple Silicon’s unified memory is the whole story for why you can run 24B-class models on a laptop at all. The GPU and CPU share the same memory pool, so a model’s weights don’t need to be copied across PCIe to a discrete GPU — they just live in RAM and the GPU (via Metal) reads them directly.

The practical thresholds, as of early 2026:

SpecVerdict
M4 base, 16 GBNot enough. Skip.
M4 Pro, 24 GBWorks for 7–8B models at Q4. Tight for 24B.
M4 Pro, 36 GBSweet spot. Devstral 24B Q4 runs cleanly.
M4 Max, 48–64 GBComfortable. Room for 32B models and longer context.
M4 Ultra, 64+ GBOverkill for coding; relevant if you want 70B+ general models.

Below 24 GB, don’t bother with coding-grade models — you’ll be swapping, thrashing, or stuck on 7B models that are genuinely worse than using the API. The 36 GB M4 Pro is the floor I’d actually recommend if you’re buying specifically to do this.

Metal acceleration is table stakes — every tool worth using enables it by default on macOS. You’ll see references to “GGUF” and “llama.cpp” under the hood; both support Metal, and Ollama wraps both.

Which model to run

As of February 2026, Devstral Small 24B from Mistral is the right default for coding on this class of machine. Released in May 2025, it was specifically trained for agentic coding workflows — Mistral optimized it against SWE-bench rather than just general benchmarks, which matters because SWE-bench measures the “open a real repo, fix a real bug” skill that you actually care about.

Two things make Devstral the default instead of the alternatives:

  1. It’s small enough to fit comfortably in 36 GB at Q4, with headroom for context. A 32B model leaves you fighting for memory.
  2. It was trained for tool-use-style coding tasks, not just raw code completion. That’s a better match for how you’ll actually use it in an editor.

The alternatives worth knowing:

ModelSizeStrengthTrade-off
Devstral Small24BAgentic coding, SWE-bench focusDefault pick
Qwen 2.5-Coder32BStrongest open coding modelNeeds more memory; slower tokens/sec
Qwen 38BVery fast, great for inline completionCeiling is lower on complex tasks
DeepSeek-Coder V216B MoEStrong, efficientSlightly older; less agentic training

If you have 48 GB+ and want maximum quality, Qwen 2.5-Coder-32B is worth trying — it’ll be slower but the output on harder tasks is noticeably better. If you want fastest-possible inline completion and are okay with a lower ceiling, Qwen 3 8B is the one.

For most people with a 36 GB M4 Pro: pull Devstral and move on.

Install

Ollama is the right runtime. It wraps llama.cpp, handles Metal acceleration out of the box, and gives you a clean HTTP API on port 11434 that every editor integration worth using speaks to.

# install
brew install ollama

# or, if you prefer: download the .dmg from ollama.com

# start the server (runs in background)
ollama serve &

# pull the model — about 14 GB at Q4_K_M
ollama pull devstral

The first pull takes a few minutes depending on your connection. Once it’s done, the model lives at ~/.ollama/models/ and loads into memory on first use.

First test

Before wiring up any editor, sanity-check the model in the CLI:

ollama run devstral

This drops you into an interactive REPL. Try a real prompt — not “hello world,” but something representative of what you’d actually send:

>>> Write a Python function that takes a list of file paths and
    returns a dict grouping them by their top-level directory.
    Handle the edge case of files at the repo root.

You’re looking for three things: Does it produce working code? Does it handle the edge case you flagged? How fast does it feel? If the answer to the first two is yes and the speed feels usable, you’re good. If it’s glacial or the code is off, check your quantization (below) and memory pressure.

Type /bye to exit the REPL. The model stays loaded in memory for a while — the next query will be faster.

Quantization

Ollama defaults to Q4_K_M quantization, which is the right call for a 24B model on a 36 GB machine. The short version:

  • Q4_K_M — 4-bit quantization with some bits preserved at higher precision. Sensible default. About 14 GB on disk for Devstral.
  • Q8_0 — 8-bit. Higher quality on edge cases, roughly 2x the memory footprint. Worth it only if you have 64 GB+ to spare and you’ve noticed Q4 quality issues on your actual workload.
  • Q2 — don’t bother. The quality drop is too steep for coding.

To explicitly pick a quantization:

ollama pull devstral:24b-small-2505-q8_0
# or whichever tag matches in `ollama show devstral --modelfile`

If you’re not sure whether your quantization is hurting you, a reasonable test is to run ten real prompts at Q4 and ten at Q8, and see if you can tell the difference blind. Usually you can’t.

Editor integrations

Three paths. Pick one — running multiple at once just means multiple processes asking the same Ollama instance for tokens.

Zed

Zed has a native ollama provider in its Assistant panel. Open settings (cmd-,), add:

{
  "assistant": {
    "version": "2",
    "default_model": {
      "provider": "ollama",
      "model": "devstral:latest"
    }
  },
  "language_models": {
    "ollama": {
      "api_url": "http://localhost:11434"
    }
  }
}

Restart Zed. The model appears in the Assistant panel dropdown. This is the cleanest setup — no plugins, no JSON config for tools, everything just works.

The trade-off: Zed’s Assistant panel is less featureful than Continue.dev’s VS Code flow. If you live in Zed, this is perfect. If you live in VS Code, keep reading.

Continue.dev in VS Code

Continue is the most flexible path — it supports inline completion, chat, and agentic edits, and points at Ollama via a JSON config.

Install the Continue extension in VS Code, then edit ~/.continue/config.json:

{
  "models": [
    {
      "title": "Devstral (local)",
      "provider": "ollama",
      "model": "devstral",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Devstral autocomplete",
    "provider": "ollama",
    "model": "devstral",
    "apiBase": "http://localhost:11434"
  },
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text"
  }
}

That config gives you chat and tab-autocomplete from the same model. If you want a smaller/faster model just for inline completion and the bigger one for chat, split the two — tabAutocompleteModel pointing at Qwen 3 8B, and the main models entry pointing at Devstral:

{
  "models": [
    { "title": "Devstral", "provider": "ollama", "model": "devstral",
      "apiBase": "http://localhost:11434" }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen fast",
    "provider": "ollama",
    "model": "qwen3:8b",
    "apiBase": "http://localhost:11434"
  }
}

That split is actually the right setup if you have the memory for both models loaded. Fast completion on every keystroke, heavier model when you explicitly ask for help.

aider

aider is the right choice when you want task-level runs — “here’s a bug, fix it across these three files” — rather than inline completion. It’s a terminal-first tool that speaks to any model via a CLI flag:

aider --model ollama/devstral

That’s the whole setup. aider will use the Ollama HTTP API directly. Its strength is that it applies edits to files on disk transactionally, shows you a diff, and commits to git if you let it. That’s a much better fit for “tackle a scoped refactor” than inline completion is.

The mental model I use: Continue.dev for as-you-type help, aider for one-shot tasks. They don’t conflict; they’re different shapes of the same underlying model.

What local coding actually feels like

Three honest observations from running this daily:

Inline completion is great. Sub-100ms time-to-first-token, on an M4 Pro, for a completion that’s usually short. Indistinguishable from Copilot in feel. Often better, because there’s no round trip. This is where local earns its slot.

Chat-style Q&A is… fine. Asking “what does this function do” or “write me a test for this” produces answers that are 80–90% as good as a frontier model, in a few seconds. For routine stuff it’s totally serviceable.

Agentic tasks are meaningfully slower than API-backed Claude. A task that Claude Code finishes in 30 seconds might take 3–5 minutes on Devstral locally. Not because the model is slow per-token (it isn’t), but because the task loop — plan, edit, re-read, edit again — compounds the latency of every step. For a 40-minute refactor, don’t benchmark local by doing this. You’ll hate it.

The better way to benchmark local: do 30 small completions in a row. Count how many were useful, how many were instant, and how many times you noticed the absence of a network call. That’s the workload it was built for.

Cost math

“Local is free” is a marketing line. It isn’t quite free. The honest accounting:

  • Electricity. An M4 Pro pegged on a coding task draws maybe 30–50W under load. Over a day of heavy use, pennies.
  • Battery. Your laptop battery empties noticeably faster. Plug in when you’re hammering the model.
  • Heat. The fans will spin up. It’s a laptop, not a server.
  • Hardware. Amortized over two to three years of use, the delta between a 36 GB M4 Pro and a 16 GB M4 Pro is a few hundred dollars. Call it $10–15 per month in extra laptop cost.

Against that: once the hardware is paid for, the marginal cost of a query is zero. If you’re running thousands of completions per day, that matters. The break-even against $X/month in API spend is fuzzy, but if you’re spending over $50/month on API calls that are mostly completions and small tasks, local probably pays back within the year of laptop ownership.

The real question isn’t cost. It’s whether the workload is a match.

What local still can’t do well in early 2026

Be honest about the ceiling:

  • Deep reasoning over large codebases. Devstral has a smaller context window and a smaller model ceiling than a frontier API model. When the task needs a mental model of a 50-file module, local falls down.
  • Long-horizon agent loops. Anything with 20+ tool calls compounds error rates faster than a frontier model does. Local is rarely the right agent backbone.
  • Strong tool use. The model can use tools, but it’s less reliable about when to use them and how to recover when a tool call fails. Frontier models are much better at this.

For any of those, stop fighting and use API Claude or GPT. That’s not a failure of the local setup. That’s just the current state of what 24B parameters can do.

The hybrid pattern

This is the setup I actually run, every day:

  • Ollama + Devstral handles inline completion in the editor. Bound to a keyboard shortcut, always on, always instant.
  • Claude Code in a terminal handles agentic tasks. When I want to plan and execute a real change across files, I reach for the terminal, not the editor’s inline assistant.

One keyboard shortcut vs a different one. Two different mental modes: “help me type faster” vs “go do this task.” The hybrid isn’t theoretical — you already make this distinction mentally when you decide whether to accept a completion or open a chat. Making it explicit in your setup removes the friction of picking the wrong tool.

The split I aim for:

  • Completion, docstring, small regex, “finish this line for me” → local
  • “Refactor this module,” “debug this test failure,” “add a new feature end-to-end” → API

Once you internalize that split, local stops feeling like a diminished version of Claude and starts feeling like a different tool that happens to live on the same keyboard.

Pitfalls

Running out of memory. If you’ve got other apps open and you load a 24B model, you can push into swap. macOS won’t crash, but everything slows down and the model generation gets erratic. Close Slack and Chrome’s 40 tabs before loading the model, or spring for 48 GB.

Background ollama serve hogging RAM. Ollama keeps recently-used models warm in memory for a while after you stop querying them. Convenient, but it means a 14 GB chunk of your RAM can be held even when you’re not actively using it. ollama stop devstral unloads it. Or set OLLAMA_KEEP_ALIVE=0 in your shell to unload immediately after each request.

Wrong quantization. If you accidentally pull a Q2 or Q3 variant, quality drops sharply. Always check ollama show <model> to confirm what you’ve got, and stick with Q4_K_M or Q8 for coding.

Embedding Ollama into latency-sensitive prod paths. Don’t. Ollama is for your laptop. For production inference, use a real inference server (vLLM, TGI, or a cloud inference provider). Ollama’s process model and memory management are built for single-user developer use, not for serving traffic.

Assuming parity with Claude. This is the biggest one and it’s the reason most people bounce off local. If you benchmark Devstral against Claude on a hard task, Devstral loses. That doesn’t mean Devstral is bad — it means you picked the wrong task to benchmark on. Pick a completion task and try again.

Getting started — the 15-minute path

If you skip everything above, do this:

  1. brew install ollama && ollama serve &
  2. ollama pull devstral
  3. ollama run devstral — verify it works with a real prompt.
  4. Install Continue.dev in VS Code (or the Ollama provider in Zed) and point it at http://localhost:11434.
  5. Take one prompt you’d normally send to Claude — a small one, a completion or a docstring or a regex — and send it to Devstral instead. Compare.

The point of step 5 is not to decide whether local is “as good as Claude.” It’s to feel the difference — the instant response, the no-network, the “I could do this on a plane” — and calibrate which of your prompts actually want that shape.

Once you’ve done that, local has a slot in your workflow. A narrow slot, but a real one. That’s the win.