# The autonomous desktop layer: Co-work, Skills, and Routines that work while you sleep

> A desktop AI agent now lives on your machine. Co-work touches your files, Skills encode how, Routines fire on schedule. What decides whether it works is not the model but two disciplines: context management and instruction quality. Here's the buildable architecture, honestly.
>
> https://pravda.systems/blog/desktop-autonomy-skills-routines · 2026-06-18

You open a tab, type a question, read the answer, close the tab. Every scrap of context dies with it. Tomorrow you re-explain who you are, what you're working on, the tone you want, again. That's the loop most people are still in: a very smart search bar wearing an assistant's face. Software 1.0 thinking in a world that already moved on.

The interesting frontier moved with it. The model can now *live* on your machine, watching your files, holding your standing context, running multi-step work, and handing you a finished deliverable while you sleep. You stop opening a browser tab to ask a question and start finding files in a folder that weren't there when you went to bed. The operators selling this call it the autonomous desktop layer, and underneath the breathless framing there's a real, buildable architecture with three parts: **Co-work** (the hands), **Skills** (the brain), **Routines** (the heartbeat), extended by connectors and a phone. The thesis here is narrower and more useful than the hype: the layer is real today, but the thing that decides whether it works for *you* is not the model. It's how disciplined you are about context, and how good your instructions are. Everything below serves that one claim.

A scoping note: this is the business-user, non-developer layer (the knowledge-and-workflow desktop), deliberately separate from the terminal/agentic-coding world and from pure workflow tools. Where they overlap (Skills exist in both, connectors are everywhere), the business-user expression lives here.

## What is the autonomous desktop layer, really?

**It's not a product. It's an architecture that bolts four capabilities onto a language model (local file access, persistent memory, scheduled execution, and tool integration) so the system can do work without a human watching every step.** A chatbot is a brain in a jar: you type, it answers, the conversation dies. An agent is that same brain wired to hands, a memory, and a goal you wrote down once. The desktop layer makes the agent *persistent*. You move from "I asked the AI a question" to "I told the system what outcome I needed by Friday, and it handled the steps."

Map it onto a body and it stays in your head:

- **Co-work, the hands.** The local desktop app that actually touches your machine. It can "open, edit, create, and organize files, or run scripts" inside a folder you designate, "rather than just responding in text." This is the only one of the three with write access to your disk, the line between a consultant who gives advice and an employee at the next desk who does the work.
- **Skills, the brain.** Instruction files that encode *how* to do one specific job well. A skill is a folder whose one required file the agent reads when the task fires. Creators describe it as "a playbook for repeatable tasks," "a recipe you give to Claude." Without one, output is "generic, predictable, zero personality," the internet's average voice.
- **Routines, the heartbeat.** Saved tasks that fire *when* something happens, on a schedule or a webhook. The load-bearing detail: they can run on Anthropic's cloud, so "your computer does not need to be on."

The field sorts the surface into three modes, and the distinction matters: **Chat** answers questions, **Code** is for technical building, and **Co-work is an employee that completes whole tasks on your computer.** Claude auto-routes. A coding ask goes to Code, a knowledge or workflow ask to Co-work.

The interplay is the whole point. Skills define *how*. Routines define *when*. Co-work defines *where*. One framing predicts everything that follows: an agent runs an **observe → think → act** loop. It *observes* the available context (identity files, the project folder, connector data, the system prompt), *thinks* through a plan against its goal, then *acts* by calling a tool, and the result feeds straight back into the next observation. The loop ends only when a **definition of done** is met. Skip that finish line and the agent loops forever, burning tokens and time. That's not decoration. It's the literal reason the discipline below matters.

## Why does capability-as-config beat one giant prompt?

**Write each skill as a plain-text file and you sidestep the entropy that kills every other agent setup.** Over time you add instructions, prompts get longer, the model starts hedging, output regresses toward generic. Logic baked into one mega-prompt collapses under that weight. You can't version it, test one piece without running the whole thing, or move it to a different interface without a rewrite.

A file fixes this three ways:

- **Portability.** The same skill file runs in the desktop agent, a code environment, a browser extension, or an API call. The interface changes, the skill doesn't. One `skill.md` works across Chat, Co-work, and Code.
- **Cheap routing.** A skill opens with a short metadata block, a name and a one-line description. The agent reads that first and decides whether to load the rest, so the full instructions enter the context window only when relevant. Enabling many skills then costs roughly what one question costs.
- **Composability.** Skill files are independent units. Update the writing skill without touching the research skill. Enable three for one project and a different three for the next. Hand one to a colleague without exposing anything else about your setup.

The counterfactual is a prompt library, saved inputs you paste by hand. Better than nothing, but it doesn't route itself, doesn't persist business context, and doesn't compound. This is the same content-is-config instinct a well-run system already uses for everything else.

## How do you set up the station so it doesn't drown in context?

**Co-work hands the model your file system, and the single most common mistake (repeated across every creator) is pointing it at your whole Documents folder.** Do that and you get three failures at once: privacy exposure, wasted tokens, and a model that hallucinates priorities because it can't separate signal from a decade of clutter.

The fix is a dedicated, deliberately small working directory (a "co-work station," sometimes called a playground) with three subfolders that bound the agent's known world:

| Folder | Holds | Why it exists |
|--------|-------|---------------|
| `context/` | Identity and standing-instruction files | The agent reads this first, every session |
| `projects/` | Active work and in-flight deliverables | Where the agent operates, its working desk |
| `output/` | Finished assets | Where you go to collect results |

Keeping output separate stops the agent treating a draft as a final artifact, and stops you feeding a previous output back in as a source. Two additions the field treats as non-negotiable. First, **guardrail instructions** that forbid deleting, overwriting, or renaming any file without confirming. This isn't a trust problem. It's operations discipline: an agent on a live file system can cause irreversible data loss in the same three seconds it produces a summary you love. Second, **memory written to local markdown, not cloud chat history.** Chat history is ephemeral, not searchable across sessions, not version-controlled, unreadable by another agent. Local markdown is all of those. The `context/` folder is your agent's long-term memory.

There's a quiet payoff beyond tidiness. Reading a local directory directly bypasses the chat window's upload ceiling (commonly cited as 20 files at 30MB each) and runs against a larger context window, so the agent compacts less often. Feed it a clean station and it operates inside known walls. Feed it your desktop and it burns context inventorying noise before it starts the job.

## How do you give the agent an identity instead of generic output?

**Even inside a clean station, an agent that doesn't know who you are produces the internet's average voice, competent and sterile.** You fix that with three markdown files in `context/`, which the agent reads automatically at the start of every session so you never re-explain yourself:

1. **`about-me.md`:** your role, business, audience, what you care about, your brand. Stops the model writing for the wrong demographic.
2. **`brand-voice.md`:** tone, style examples, the words you use and, critically, the words you *avoid*. That single AVOID list does more to kill "AI slop" than any prompt trick. Generic output is, mechanically, the absence of constraints. A list of banned phrasings is a hard constraint the model can't route around.
3. **`working-preferences.md`:** how you want documents formatted and how the agent should handle ambiguity. Want prose, not bullets? Say so. Want it to *ask* rather than guess? Enforce it here. That's the rule that turns a confident wrong answer into a one-line clarifying question.

The same instinct scales one level up: **Global Instructions** (Settings → Co-work → Global Instructions) set a default tone, role, and formatting brief once, applied across every project. Per-project context files specialize from there.

A prompting shift comes with this. In a chat window you write **task-first**, step-by-step instructions, because you're guiding a conversation. With a file-acting agent you switch to **outcome-first**: describe the destination and the rails, then let the agent plan the route. Task-first language micromanages the execution path, so the agent can't adapt when a step fails. Outcome-first gives it a target and a quality gate.

- Task-first: "Read each receipt file and add up the totals."
- Outcome-first: *(reserved for members — sign in free at pravda.systems)*

The first produces a number. The second produces a usable deliverable, and it survives a receipt with a weird layout because the agent owns the route, not just the steps. The Co-work demos lean on exactly this for chores like extracting 100+ receipts (PDF/JPEG) into Excel, splitting a 400MB+ PDF into chapters, or converting image-slides into editable PPTX.

## What's inside a Skill, and how does the DBS structure work?

**A skill folder has one required file and two optional layers, forming a structure worth naming: Direction, Blueprints, Solutions.**

| Layer | Folder | Contents |
|-------|--------|----------|
| **D**irection | `skill.md` | Metadata, the step-by-step workflow, the rules, and the definition of done |
| **B**lueprints | `references/` | Static assets the workflow needs (voice guides, style samples, brand data, lookup tables, one shown example bundles a 40-page brand-guide PDF) |
| **S**olutions | `scripts/` | Code (usually Python) for what natural language does badly: API calls, file conversions, calculations |

**Direction** is the skill's constitution. The metadata block (name + description) lets the agent decide whether to load the file at all, so unused skills cost almost no tokens. Below it sit three sections: a workflow written like a standard operating procedure for a new hire (numbered steps, concrete actions, explicit decision points. "If the source has no date, flag it rather than estimate"), the rules, and the definition of done. That finish line is the most commonly omitted element and the most common cause of disappointing output. Without it the agent stops at the easiest point that still resembles a finished result. With it, *(reserved for members — sign in free at pravda.systems)*, it must match the spec or keep going.

A skill overrides generic output with a deterministic path. A "YouTube Script" skill doesn't say "write a script." It enforces a specific hook structure, dictates pacing, and requires a named call-to-action format.

**Blueprints** live in `references/` rather than inside `skill.md` because they change independently and because stuffing large static assets into the main file wastes tokens on every load. The agent reads the relevant reference when the workflow calls for it, not before.

**Solutions** are the part that separates a real skill from a saved prompt. Some creators treat a skill as "just a system prompt." The capable ones don't. That's the whole point of the `scripts/` layer. The moment a job touches data or another system, the skill *calls code* with a contract: defined inputs, outputs, and failure modes. The agent generates a script on first use, saves it, and reuses it next time. The first run builds the tooling. Every later run uses it, so the skill gets faster and more reliable over time. If your skill only rewrites text, the prompt view is fine. The moment it touches data, treat it as code.

How do you author one? The field disagrees, and the disagreement is worth resolving rather than splitting. One camp builds top-down with the DBS framework. The pre-built **Skill Creator** skill automates it, interviewing you about your process and writing the `.md` and scripts for you. The other builds bottom-up by reverse-engineering: perform the multi-step task by hand in Co-work until the output is right, then say "create a skill from this workflow." **The call: start bottom-up, then formalize top-down.** Reverse-engineering from a run that already worked guarantees the steps are real, not imagined. The Skill Creator then hardens them into the DBS layout. Add a **self-improvement loop**: after a run that got something wrong, edit the skill so the next run doesn't repeat it, and the system improves session over session instead of plateauing.

## How do you make it run on its own, and where does that break?

**A routine is a skill that runs on a schedule rather than on demand. You click Schedule, write the prompt, pick a frequency, save.** The defining property: unlike a deterministic workflow tool, a routine is *non-deterministic*. "You can't drag a node that thinks. You write a paragraph describing what you want and Claude reasons through the steps each run." That's a feature for fuzzy work and a liability for anything that must execute identically every time.

The canonical first routine is the **morning briefing**: early each day it checks Gmail for unread mail in the last 12 hours, checks Google Calendar, summarizes what's on your plate, and writes `morningbrief.md` into `output/`, in your voice, prioritized by your preferences. (Creators quote different fire times, 6, 8, 9 AM, which tells you the time is illustrative, not a spec.) You read it the way you read a newspaper, not the way you review a draft. A content variant is just as common: each morning, check specific X accounts over a 24–48h window, de-dupe against prior posts, and update a local HTML landing page. That moment you stop thinking about the underlying task at all is the signal a routine has landed.

The architectural decision that matters most is local versus remote, and they are not interchangeable:

| Aspect | Local routine | Remote (cloud) routine |
|--------|---------------|------------------------|
| Runs on | Your CPU/RAM | Anthropic's cloud (fresh Linux env per run, a cited spec is 4 vCPU / 16GB RAM / 30GB disk) |
| Machine state | Must be awake, app open | Laptop can be off |
| Cost | Draws your usage allowance | Hourly limits. Research-preview, so expect change |
| Connectors | Limited to local tools | Full Gmail / Slack / Drive access |
| Secrets | `.env` files | Environment-variable panel (never in the prompt) |
| File access | Full station + local apps | Fresh environment each run. Often a GitHub repo as the workspace |
| Best for | Morning briefings, context-rich drafts | Scheduled publishing, monitoring, overnight jobs |

The rule the field converges on: **prove the logic locally, then graduate to remote for set-and-forget.** A local routine meant to publish overnight skips most nights. A remote one survives a restart but only sees your business context if you keep it where the routine can reach, a repository it can clone. Don't pay the remote setup cost until the logic is verified.

For event-driven rather than time-driven work there's **Dispatch** (mobile delegation): text a task from your phone, the desktop does the heavy lifting against the same files and history, and you get a push notification when it's done. "Work completed when it's done," setup cited at roughly two minutes. It inverts the usual model. Instead of sitting at the desktop directing the agent, you send a short outcome prompt and collect the result later. It works for tasks with a clear output and poorly for anything that needs back-and-forth. Its precondition is a well-specified skill already in place. A vague outcome with no skill loaded produces a vague result.

Now the honest part the demos skip. **There is no pause-and-ask inside a single routine.** A routine that needs human approval mid-flow can't stop and wait. The workaround is real and worth copying: split the logic across two routines with a connector as the gate. *(reserved for members — sign in free at pravda.systems)* Anyone claiming a single routine safely "handles the whole approval flow" hasn't hit this wall yet. The two-routine gate is what makes supervised autonomy real instead of aspirational.

## How do you wire it to the rest of your stack?

**In isolation the layer is useful. Wired to your data it becomes far more capable, and the best operators don't bet on one platform.** MCP connectors do the wiring (Gmail, Drive, Calendar, Notion, Slack, Canva, Figma, WordPress, media tools) and once connected the agent reads *and acts*. Permissions are per-connector and granular: **Always allow / Needs approval / Block.** That's the practical knob for keeping a connected agent safe. The principle mirrors skills: grant the minimum access that makes the task possible. An agent that reads your mail doesn't need to send mail. Add send access only when a tested routine uses it correctly. Permissions accumulate silently, audit them on a fixed cadence.

Three integration patterns recur in real builds:

- **Live Artifacts.** Unlike a static PowerPoint, a Live Artifact pulls fresh data through connectors every time it opens, a dashboard wired to live sales from a sheet or open tickets from Slack, shareable with a team, instead of burning tokens regenerating it each time.
- **Browser use.** A Chrome extension lets the agent "read what is on the screen, click buttons, fill forms, and record + schedule multi-step flows," the escape hatch for any app without an API.
- **Office + Design surfaces.** Word/PowerPoint/Excel extensions (with cell-level citations) and a slide/infographic canvas round out the deliverable side.

And the division of labor that matters most as you scale: a reasoning layer for judgment and the human interface, and a deterministic workflow engine for execution and an audit trail. The reasoning layer (Claude) is good at fuzzy work (planning, drafting, deciding). An engine like [n8n](https://n8n.io) is good at the opposite: steps that must run identically every time, with retries and a logged history. [Make.com](https://make.com) covers simpler integrations. Custom Python handles heavy data crunching. Two reasons push heavy logic into the engine. First, cost and reliability. Deterministic work there saves tokens and dodges rate limits. Second, and more important: **auditability.** A skill running on one workstation has no execution history, no audit log, no replay. When a client or a regulator asks "how was this produced?", the answer "my desktop agent ran a skill" is not an audit trail. "Skills living on an individual computer is a compliance problem." Execution history lives in a tool built for it. *(reserved for members — sign in free at pravda.systems)* The reasoning layer is the conductor, not the whole orchestra.

## What's the non-obvious bottleneck, and how do you beat it?

**Across every serious build the limiting factor is not model capability. It's context management.** The most-cited concrete claim: a query that scores ~100% accuracy at 10k tokens can degrade toward ~40% near 199k tokens as the window fills. A long, stuffed context doesn't just slow the model down. It makes the model *worse*. (Reported, not independently verified, but it matches the lived failure mode: the agent slows, drifts, then halts.) The implication reframes the whole layer: **autonomy is not one giant conversation. It's many focused conversations with clean handoffs.**

Four techniques fight context rot, in rough order of impact:

1. **The iceberg technique.** Keep only the surface in active context, high-level instructions and metadata. Pull detail from under the water on demand with `read`, `grep`, and `glob` instead of dumping whole files in. Maximizes useful information per token.
2. **Sub-agents.** Delegate heavy work to a sub-agent that gets a fresh window and reports back a summary, so the main thread never fills with a subtask's minutiae. Practically: multiple sessions in the same project directory, each with a role.
3. **Pin data while building.** When debugging a workflow, pin a step's output so downstream steps reuse that data without re-running the trigger, and, more importantly, without re-paying for expensive model calls.
4. **Compaction and clearing.** `/compact` compresses history (the field quotes it squeezing roughly 80% of the context down to 30–40% of its size) buying room at the cost of some detail. `/clear` starts fresh when you switch tasks. Keep the essentials in a standing file (a `claude.md`) so compaction never throws them away.

If you internalize one section from this entire scene, make it this one. The operators who succeed aren't the ones with the cleverest prompts. They're the ones who treat the context window as the scarce resource it is.

## How does the memory layer compound, and without rotting?

**The most durable architecture for a system that improves rather than plateaus is a simple pipeline: ambient capture, a local vault, an agent that reads it.** The three pieces are independent and you assemble them gradually.

Ambient capture means business context ends up in machine-readable files rather than your head: notes from calls, decisions made, constraints discovered, key passages from research. A screen-and-mic capture tool can do this continuously, exporting transcripts into the vault overnight. The tool matters less than the habit. The vault is a directory of markdown, organized, version-controlled, searchable, dependent on no single platform's survival. The agent reads the relevant files on every prompt that needs business context. This isn't complex retrieval engineering. The agent opens the files and applies what it finds.

The compounding effect is the payoff. Add a constraint to the vault and every future task that reads it benefits. Refine a voice guideline and every future output under it improves. The system accumulates context the way a good employee accumulates institutional knowledge, session over session.

This is the part that fails if you neglect it, so the discipline is non-negotiable. **On-demand recall, not auto-load-everything:** the agent pulls the slice it needs rather than dragging the whole vault into context on every call. Auto-loading everything is the context-rot bottleneck wearing a different hat. **Periodic refactor:** on a fixed cadence, have the agent condense the memory file, keeping names, projects, decisions, and lessons while dropping small talk. **Keep it local and encrypted.** The vault holds your most sensitive business intelligence. Keep it off cloud backups, encrypt the disk, and if you version it with git, encrypt the contents. A leaked memory vault is worse than a leaked password, because it's the context, not just the key.

## What's hype, and what's substance?

**The marketing outruns the mechanics in predictable places.** Hold the claims against what the same builds actually show:

| Claim you'll hear | What the builds actually show |
|---------------------|-------------------------------|
| "AI agents replace people / developers" | Agents augment. Production work still needs human oversight |
| "Skills work perfectly out of the box" | Skills need real iteration before they're reliable |
| "Routines run forever, no issues" | Hourly limits, no in-routine pause, error handling required |
| "Co-work can access your whole computer" | Sandboxed to a folder you designate, not blanket access |
| "Automate 99% of your content / save 90% of your time" | Single-source, self-reported. The *pattern* is sound, the *percentage* is marketing |

Treat any income, follower, time-saved, or "ten routines hand back a whole workday" figure as motivation, not a benchmark. These are creator-reported and unverified, and many tutorials that quote them funnel toward a paid product. Note the conflict of interest when a "best stack" happens to be the creator's own tool.

Take the self-healing CI/CD demo. The claim: a build fails, the agent reads the error log, fixes its own code, prevents the recurrence, no human. The shown mechanism is real and genuinely runs. GitHub webhook → log analysis → diagnosis → fix → Base64-encode → commit via the GitHub API → open a PR → email notification. The *framing* ("no human") is hype, because a human still reviews the PR before it merges, which is exactly correct, and exactly what the two-routine approval gate exists to enforce. Hold both thoughts: the plumbing is substance. The autonomy is supervised.

## What's the build plan, and how do you verify each phase?

**Explicit, in order, with a verification gate at each step. The gate is the point: if it fails, you have one broken layer to fix, not a vague sense that "it's not working."** Phases 1–3 use nothing beyond the desktop app. Phase 4 reaches for the rest of the stack.

**Phase 1: Foundation.** Install the desktop app. Without it there's no Co-work, Skills, or Routines. Create the station with `context/`, `projects/`, `output/`, plus a guardrail instruction (never delete, overwrite, or rename without confirming). Set Global Instructions with your default tone, role, and formatting. *Verify:* ask "what do you know about me?" It should answer from your files, not generically.

**Phase 2: Identity and first skill.** Write the three `context/` files (about-me, brand-voice with its AVOID list, working-preferences with ask-don't-guess). Then build one skill from a weekly task, bottom-up: perform it by hand until the output is right, say "create a skill from this workflow," and let the Skill Creator harden it into the DBS layout. *Verify:* run the skill in a fresh session on new input. The output should match your manual run, not regress to generic.

**Phase 3: First routine, local then remote.** Pick a recurring task. The morning briefing is the obvious starter. Build it locally, watch 3–5 runs, confirm output by hand. Only then convert to a remote routine for set-and-forget, moving secrets into the environment-variable panel and making your context reachable. *Verify:* the deliverable appears after the scheduled time with zero manual intervention.

**Phase 4: Integration.** Wire MCP connectors and confirm with a read-only request ("summarize my last 10 emails") before granting any write access. Stand up a deterministic engine like n8n for any process that needs an audit trail or must run identically. Add sub-agents for heavy research. Implement the two-routine approval gate for anything high-stakes. *Verify:* an external trigger fires and completes end-to-end, and a deliberately injected failure either recovers or notifies you.

The criterion for "is it working?" is behavioral, not technical. You'll know the layer has landed when you stop opening a browser tab to ask a question and instead find yourself opening files in `output/` produced while you were away. Success is not the quality of one reply. It's the accumulation of autonomous actions executed in your voice, to your standard, without your presence.

When you're unsure which tool a job wants:

| Scenario | Use | Don't use |
|----------|-----|-----------|
| One-time complex analysis | A manual Co-work session | A routine (overkill) |
| Daily or weekly recurring report | A routine | Manual execution |
| Repeatable task with fixed steps | A skill | A fresh prompt each time |
| Multi-step process that must run identically | A deterministic engine + the reasoning layer | One monolithic non-deterministic routine |
| Needs human approval mid-flow | Two routines + a connector gate | A single autonomous routine |
| Heavy data crunching | The engine + Python | The model directly (token cost) |
| Audit trail required | The engine's execution history | Skills scattered on a laptop |

## The honest bottom line

The autonomous desktop layer is real and usable today, and it is not magic. Stripped of the demo-reel gloss, the builds say success needs five unglamorous things: instruction quality worth iterating on, disciplined context management, appropriately scoped tasks (start low-stakes and high-frequency), a human-oversight architecture for anything that matters, and multi-tool integration instead of one-platform faith. The "no coding required" crowd is half right. You can build genuinely capable systems without writing a traditional program. But you still have to *think* like a systems architect: inputs, outputs, failure modes, the definition of done, and how you'll verify it. The code is optional now. The systems-architect thinking never was.

One last caution. The dated parts (the exact upload limits, the current cloud specs, this season's release notes, the interface walkthroughs) will rot first. The architecture under them won't. Hands, brain, heartbeat. A clean station. An identity that kills generic output. Outcome-first instructions. The two-routine gate. Context treated as the scarce resource. A memory spine you maintain. Build for that, and a model swap or an interface change costs you a config line, not a rebuild.

Start small. Prove one skill. Schedule one routine. Verify it ran. Then scale.
