[ Blog ] · 8 min read
Who Are We Writing For Now?
You connect an agent to your design system. You give it the MCP server, the component library, the docs — everything we'd spent years writing. You ask it to build something simple, and what comes back is technically correct and quietly, completely wrong.
The first instinct is that the AI isn't good enough yet. But it had access to everything. The real problem is what "everything" turned out to be — and underneath that, who we'd always assumed was reading it.
What actually changed
Until recently, the hours went into building: getting the component to behave, the tokens to cascade, the build to pass. We always planned, but a plan could only be tested by building, and building was slow. You could spend a week deciding what a thing should be and three weeks finding out you'd decided wrong.
What's collapsing now isn't the planning. It's the distance between a plan and a thing you can poke at. Prototyping has gone from weeks to an afternoon. The loop has snapped shut.
A short loop changes what's scarce. When building was slow, the decisions hid inside it — the judgement about what should exist, the reasoning about why one option is right and another wrong. That was always the hard part; we just never had to look at it directly. Shut the loop and the decisions are standing there alone — and most of them were never written down.
For those of us who work on design systems, that should feel less like a threat than a mirror.
We documented the easy half
Here's the problem made concrete, from our own system.
An engineer is building a delete confirmation. They ask our MCP server for a modal, and it nails the mechanics — finds the component, reads the props, wires up cancel and submit actions that behave exactly as documented. It hands back a clean, working modal. It just isn't the critical variant, the one we built for destructive, irreversible actions. And it's not that we don't document that variant — it's right there in Storybook. But the story was showing the variant, not stating the rule. Nowhere did we write "use critical for destructive actions." We didn't, because to us it was obvious.
The decision was the entire point, and it was the one thing we never wrote down — because our reader had always been someone who already knew. Watch a designer onboard: they open Figma and, before drawing anything, they scroll. They absorb how surfaces stack, how spacing creates rhythm, how components sit together — almost none of it written down, all of it picked up across years of handovers and critiques. The component docs were never the system. They were reference notes for people who'd already absorbed the system by osmosis. We wrote the tip and trusted the reader to supply the iceberg.
That held for exactly as long as our only audience was human. The new reader is different in kind. It isn't a flawless oracle — it's a profoundly literal one. It takes what you give it at face value and supplies none of the missing why; where the page is silent, it guesses. Ask it to place the buttons and it won't know that primary actions sit on the right and secondary on the left, because we never wrote that down. The agent sees a layout, not a rule. Same failure as the modal, same cause.
The gaps were never the AI's fault — they were always there. A human hits a gap and smooths it over with judgement; the literal reader hits the same gap and falls straight in. We'd just never had a reader honest enough to fall.
It's tempting to assume the next model will be clever enough to infer the variant. It won't. A better model may get better at guessing the conventions everyone shares — the patterns it's seen across a thousand other systems. What it cannot do is reconstruct the choices specific to yours, the ones with no analogue anywhere else — which are exactly the choices that make your system worth having. Intelligence narrows the gap on the generic and widens your exposure on the distinctive. You can't infer your way to information that was never written down.
A design system was never a library of components. It's a body of decisions, and the components are those decisions made executable. We just got to keep most of them implicit. The "not ready for AI agents" framing isn't about tooling. It's about legibility: whether a system can be reasoned about by something that doesn't already understand it.
We've done this before — with tokens
The field has already solved a version of this once, and the same move works again one level up.
Consider how design tokens matured. In the beginning we named things after what they were: blue-500, space-16. Honest, but limiting — the name told you the value and nothing about the meaning. Every literal name is a promise about a specific value, and the day the system has to flex, that promise becomes a liability. Add a dark theme and gray-900 lands on the wrong side of the contrast. Add a density mode and a hard-set space-16 won't budge.
So the field invented a middle layer. You stop saying red-500 and start saying text-critical. You stop saying white and start saying surface. The semantic token doesn't describe what the value is; it describes what it's for — and that one move decouples the decision from its implementation. The value underneath can change for dark mode, a denser layout, a rebrand, and every consumer keeps working, because they asked for the meaning, not the raw value. As Figma puts it, you name for role, not appearance.
The discipline is easy to fake. Nate Baldwin has a sharp line: a semantic token stops being semantic the moment it describes value instead of role. Call something spacing-small and you've dressed a literal value in a roomier name — change it and everything breaks the same way a primitive would. Call it spacing-inline and you've said something about where and why it applies. The first is a synonym. The second is a reason.
Now set that beside our problem. The literal reader takes what you give it at face value, with no instinct for the why — and the semantic layer is the one part of the system already built for that reader, because in a semantic token the name is the reasoning. text-critical doesn't just resolve to a colour; it tells anything reading it that this text matters and should survive a theme change. That's not documentation in a separate file hoping to be read. It's intent welded onto the thing, in a form a machine can act on.
The analogy needs care, because it's tempting to over-apply. You can't pull the same trick with a component. Spin up modal-critical, modal-destructive, modal-confirm for every job a modal does and you've got a component per sentence — the exact thing variants exist to prevent. The critical variant already exists; the engineer doesn't need a new component, they need to know to reach for it. So what's missing isn't a better name. It's the rule: machine-readable metadata that says when critical applies and when it doesn't. The token lesson holds — it just lands one level up. For colour, the reasoning fit into the name; for a component, it has to be welded on as metadata.
So how do you encode a decision?
The semantic-token move had a specific shape worth stealing: take a decision living in someone's head and attach it to the artifact in a form a machine can read without inferring. There are three places to put it, depending on how much the artifact has to do.
Put it in the name — where the name can carry it. This is what the token layer did: text-critical over blue-500, spacing-inset over space-16. When an artifact is small enough that its entire purpose fits in a label, the name is the rule. This stops scaling the moment the artifact does more than one thing.
Put it in structured metadata beside the component. Not prose docs a human skims, but machine-readable when and instead-of fields the MCP server serves alongside the props. Use for destructive, irreversible actions. Not for dismissable notifications — reach for Toast. This is Cristian Morales's structured-metadata point made executable: the same rule you'd give a designer, translated into a machine-readable domain instead of left as a red annotation in Figma.
And there's hard evidence it lands. Diana Wolosin's benchmark at Indeed parsed 77 components and ran eight metadata configurations across 1,056 prompts, measuring token cost and accuracy. Markdown docs burned ~30,000 tokens a query at 82% coverage, hallucinations included. The same content as structured JSON came back more accurate, on 80% fewer tokens, at roughly a fifth of the annual cost — $300 against $1,500. Her rule of thumb: JSON for MCP, Markdown for LLM. Props and variants are a contract, so they belong in JSON; prose guidance belongs in Markdown. The exact figures won't generalise, but the direction is hard to miss.
Put it in a rule the system asserts. Some decisions aren't tied to one component — "primary right, secondary left" is a property of how the whole system lays out. Those have to be stated as constraints the agent reads, not patterns it reverse-engineers from a screenshot.
The through-line: you're converting tribal knowledge into asserted knowledge, one artifact at a time. And if that sounds like months of documentation, here's what makes it tractable. You don't have to find the gaps by hand, and you don't have to write the rules from a blank page. Point the literal reader at the system, ask it to use it, and every place it goes confidently wrong is a gap lit up in neon. Better still, it's good at exactly the part we're worst at: take a convention you can only express as a shrug and a "you just know," and the agent will draft it into explicit, structured form. It writes the first draft of the rule it just broke; you only decide whether the rule is true. That inverts the cost — the expensive part was never the writing, it was the noticing and the phrasing, and the agent does both. You don't owe the whole system at once. Each rule you assert is one fewer gap for the next reader to fall into, so the work pays down from the first commit.
Repositioning ourselves
So where does this leave the people who do the work?
The centre of gravity moves. If execution is getting cheap, the scarce skill is everything execution used to obscure: knowing what should exist, holding the whole system in tension, deciding deliberately rather than by reflex, and being able to say why clearly enough that something with no shared context can act on it. We become less the people who build the components and more the people who can account for them — its editors rather than its sole authors, deciding what's true while the literal reader keeps us honest about the gap between what we've said and what we only meant.
That isn't a demotion. It's the part of the job that was always the most valuable and the least visible, finally having to show its working.
And there's a quieter reason not to put it off, one with nothing to do with agents. A system that stays implicit fails every new hire who spends their first month guessing at conventions nobody wrote down. It fails every team that forks your components and reaches for the wrong variant. And the day the people who hold the context move on, it leaves with them — and the system you spent years building quietly stops being legible to anyone. That debt was always accruing in the gap between what you knew and what you'd written. The agent didn't add to it. It just sent the invoice early, while you still have the means to pay.
When you wrote your documentation, who did you picture reading it — and how much of what makes the system work did you assume that reader already knew? Where do your buttons go, and could anyone outside the team say why? Which variant is right for a delete, and does anything in your system actually say so?
The agent isn't a verdict on the technology. It's a mirror — and what it reflects is how much of your system you can explain, versus how much you've just been getting away with. Connect one to your system this week and ask it to build the thing you're surest about. Where it goes wrong is your map.