The blocking problem in AI-assisted development is underrated. You're mid-thought, you ask the assistant to generate a test suite or rewrite a module, and then you sit there watching tokens arrive for two minutes. When it finishes, your mental context has half-evaporated. You've lost more than time — you've lost the thread.

The instinctive fix is to open another tab, start another session, and do something else while the first task runs. That works, but it's friction. You're now context-switching between sessions manually, tracking state in your head, and merging outputs yourself. The assistant is doing asynchronous work through a synchronous interface.

The pattern I landed on is simpler: give the AI a task queue as an MCP tool. It submits work, gets a task ID back immediately, carries on with whatever it was doing. A local worker pool drains the queue in the background and writes results directly to the workspace. No polling, no manual merging — the files just appear.

The queue model

The core idea is a SQLite-backed priority queue exposed as an MCP server. The AI interacts with it through typed tool calls — submit_task, get_task, list_tasks, retry_failed — over the standard MCP stdio transport. From the AI's perspective, it's just tools. The queue mechanics are entirely internal.

System overview: AI session submits tasks via MCP stdio to a worker server, which enqueues to SQLite, a worker pool claims and executes tasks through LLM call / shell / agent paths, writing results to workspace. HTTP API sidecar reads same database for browser dashboard.
System overview. The HTTP API sidecar is completely separate from the MCP server — it reads the same SQLite database but never writes to it during normal operation.

Tasks have a type — write_file, code_task, shell, test_gen, refactor, agent, and a handful of others. The type determines which executor handles the task: an LLM call that parses structured output and writes files, a validated shell command, or a full multi-turn agent loop. Tasks also carry a priority (1–10), an optional output path, context files to inject, and a large_context flag that overrides model selection to one with a 1M-token window.

Submitting a batch looks like this from the AI's side:

# Submit a batch from within an AI coding session
result = await submit_batch([
    {
        "task_type": "test_gen",
        "spec": "Write pytest suite for src/auth/tokens.py covering all public functions",
        "output_path": "tests/test_tokens.py",
        "context_files": ["src/auth/tokens.py"],
        "priority": 3
    },
    {
        "task_type": "write_file",
        "spec": "Write a CHANGELOG entry for the auth refactor in conventional commits format",
        "output_path": "CHANGELOG.md",
        "priority": 5
    }
])
# Returns ["task-uuid-1", "task-uuid-2"] immediately
# Worker pool picks them up and runs them in the background

The return is two task IDs. The AI session is free to continue. The worker pool claims tasks atomically — a single UPDATE ... WHERE status='pending' RETURNING * with no external lock — and executes them concurrently up to the configured concurrency limit.

The state machine

Every task moves through a small set of states. The transitions are atomic SQLite writes, which means there's no race between concurrent workers claiming the same task even without a message broker.

Queue state machine: pending → running → completed or failed. Cancelled branch from pending. Retry arc from failed back to pending. Purge terminal from completed and cancelled.
State transitions. Failed tasks don't auto-retry — you inspect them first, then decide. This is intentional: automatic retry on LLM failures tends to spend money on the same broken prompt repeatedly.

Failed tasks stay visible until you act on them. You can retry a single task, retry all failed tasks at once, or leave them alone and read the error message to understand what went wrong. The queue never silently discards a failure.

Completed tasks accumulate a cost record: input tokens, output tokens, and cost in USD calculated from provider pricing tables or taken directly from the API response. The aggregate is available from the stats endpoint. Useful for getting an honest picture of what a sprint of background generation actually cost.

Model routing

Each task can specify a model explicitly or use "auto", which routes to a primary provider with fallback logic. In practice: try AWS Bedrock first (cheaper for sustained volume, stays within a private endpoint if desired), fall back to OpenAI if credentials are absent or the call fails. Named aliases like "gpt-4o" or "o3-mini" bypass the routing logic and hit OpenAI directly.

Model routing diagram: task model field routes through resolve_model function, auto routes to Bedrock with OpenAI fallback, named OpenAI aliases bypass to OpenAI directly, large_context flag overrides to 1M context model.
Routing logic. The large_context flag is the only override that bypasses the model field — it's designed for tasks where context window size matters more than model preference.

The large_context flag is worth calling out. Some tasks — summarising a large codebase, writing an article from a long research doc, processing a full test suite for review — need a model with a wide context window more than they need a specific model family. Setting the flag forces the worker to route to a model with a 1M-token context window regardless of what the model field says. The task submitter doesn't need to know which specific model that is today; the queue configuration tracks it.

Steering injection

Consistent context without repetition: Background tasks run without the project context that's loaded in the interactive session. Steering injection solves this: configure a path to a project rules file (a .clinerules, CLAUDE.md, or equivalent) and the worker prepends its contents to every task's system prompt automatically. Coding standards, naming conventions, architecture constraints — they apply to background work without the task submitter having to include them in every spec.

This matters more than it sounds. Without it, background tasks produce code that's technically correct but doesn't match the project — wrong import style, wrong error handling pattern, wrong test structure. You end up doing cleanup passes that cost as much time as the generation saved. Steering injection is cheap to set up and pays for itself quickly.

Agent tasks

Most task types are a single LLM call: send a prompt, receive text, parse it, write a file. Agent tasks are different. They run a multi-turn ReAct loop using the Claude CLI as a subprocess, with a sandboxed set of tools: read files, write to a workspace sandbox directory, submit further tasks to the queue. The loop runs for up to twenty turns before timing out.

# Worker loop for agent tasks — simplified
async def execute_agent(task: Task, config: Config) -> ExecutionResult:
    from claude_agent_sdk import query, ClaudeAgentOptions

    options = ClaudeAgentOptions(
        max_turns=20,
        permission_mode="bypassPermissions",  # sandboxed worker, not user-facing
        system_prompt=build_system_prompt(task, config),
    )

    result_text = ""
    async for message in query(prompt=task.spec, options=options):
        if isinstance(message, ResultMessage):
            result_text = message.result

    return ExecutionResult(
        output=result_text,
        cost_usd=message.cost_usd,
        input_tokens=message.input_tokens,
        output_tokens=message.output_tokens,
    )

Agent tasks are useful for work that requires iteration: write a file, run a linter, fix what the linter found, repeat. A single LLM call can't do that; an agent loop can. The tradeoff is cost — a twenty-turn agent run costs significantly more than a single call — so they're worth reserving for tasks that genuinely need the iteration.

The HTTP sidecar

The MCP server communicates over stdio. It's invisible to a browser. If you want to see what's in the queue — what's running, what failed, what it's cost today — you need something that speaks HTTP.

The solution is a thin FastAPI app that reads the same SQLite database. It runs as a persistent background service on a fixed port, managed by launchd (macOS) or systemd (Linux). It's read-only for most endpoints; cancel, retry, and purge are the only write operations, and they only operate on tasks in states where mutation makes sense.

# api.py — the whole thing is ~150 lines
app = FastAPI()
app.add_middleware(CORSMiddleware, allow_origins=["*"])

@app.get("/api/stats")
async def stats():
    return await get_stats(DB_PATH)

@app.get("/api/tasks")
async def tasks(status: str = "all", limit: int = 100):
    return [t.to_dict() for t in await list_tasks(DB_PATH, status)][:limit]

@app.post("/api/tasks/{task_id}/retry")
async def retry(task_id: str):
    task = await get_task(DB_PATH, task_id)
    if not task or task.status != "failed":
        raise HTTPException(404)
    await retry_single(DB_PATH, task_id)
    return {"ok": True}

The worker dashboard wired to this API shows live stats chips (pending / running / done / failed / total cost), a filterable task table with spec snippets, and action buttons for the write operations. It's the kind of visibility that makes the queue feel like infrastructure rather than a black box.

What this is and isn't

SQLite is the right choice for a local-first single-developer queue. It handles hundreds of tasks per day without any tuning. It does not scale horizontally — if you want multiple machines draining the same queue, you need a proper message broker. The design is explicit about this: local-first is a constraint, not an oversight.

Agent tasks are the rough edge. A twenty-turn loop doing file operations in a sandbox is genuinely useful, but the failure modes are weirder than a single LLM call. Tasks can fail partway through, leave partial file writes in the sandbox directory, and produce error messages that require reading the agent's internal turn log to understand. Better tooling for inspecting agent task failures is on the list.

Cost tracking is approximate for agent tasks because multi-turn token counting is messier than a single call. The numbers are directionally right but I wouldn't use them for billing.

None of this is novel in isolation. Task queues are old. LLM wrappers are everywhere. What's less common is an MCP server that is the queue — so the AI submits to it directly, over the same transport it uses for everything else, without any extra integration. That feels like the right shape for local-first AI tooling: instruments that fit into the existing protocol surface rather than adding new ones.

If the articles or tools have been useful, a coffee helps keep things running.

☕ buy me a coffee

Related tools and articles

→ Brood: worker queues for governance agents → Routing work to the right model → From Paperclip to Lambda → The MCP token tax

ticketyboo brings governed AI development to your pull request workflow. 5 governance runs free, one-time welcome grant. No card required.

View pricing Start free →