On June 27th, Anthropic published Project Vend, a controlled experiment testing whether its language model, Claude, could manage a small office snack store. The model was given a digital toolkit and prompted to act like a store owner, with human workers carrying out its decisions. The aim was to explore how well an AI agent could operate in a semi-realistic economic environment. What followed was a fascinating yet flawed demonstration of what happens when we confuse linguistic coherence with operational competence.

This article unpacks the design, results and deeper implications of Project Vend. It lays bare the core misalignment between helpfulness and competence, identity and agency, and illusion and reality in AI-based labour. More importantly, we need to ask a question that makes some panic: how close are we really to job-displacing AI agents, and what should we be preparing for instead?

Project Vend: What Is It, in Simple Terms?

Project Vend ran for approximately one month in early 2025 within Anthropic’s San Francisco office. Claude Sonnet 3.7, nicknamed “Claudius”, was given a modestly stocked mini-store, a 100-euro store credit, and access to a suite of digital tools: Slack for communication, Google Search, email, Google Drive, and a checkout app on an iPad.

But the system didn’t act on the world directly. Claude was not hooked into a robotics stack or IoT control. Instead, it issued written instructions to a human team from Andon Labs, who executed its decisions manually. If Claude decided to reorder drinks, a human ordered them. If it designed signage or offered discounts, humans implemented them.

This structure made Project Vend more of an indirect deployment test: a language model simulating a store manager in a sandbox with real-world interfaces. Not automation, but orchestration.

To give Claude context and structure, Anthropic primed it with a system prompt, for a set of operating rules and background data to frame its role. Here is the prompt directly from Anthropic’s webpage:

Copy to Clipboard
What Claude (or Claudius) Did Well

Even within its sandboxed role, Claude demonstrated several capacities that show why language models are beginning to challenge assumptions about cognitive labour. One of the most notable was its ability to independently source new products. Drawing on real-time web search and interpreting informal user feedback from Slack, Claude decided to stock Chocomel, a Dutch chocolate milk that quickly became a bestseller in the office. Claude also established and maintained a persistent presence as a “shopkeeper,” interacting with users through messages, signage and even customized services. It offered to place special orders for users and adjusted store offerings in response to conversation, effectively mimicking concierge-style service. Beyond functionality, Claude developed a personality, complete with humour, pop culture references, and meme-aware tone that engaged users as though they were interacting with a quirky brand, not a static software system or vending machine.

The Major Failures
  • Poor Financial Judgment: Claude consistently stocked items at a loss. Even after identifying popular products, it failed to optimize pricing for profit and neglected high-margin alternatives like Irn-Bru (as the article says: “was offered $100 for a six-pack of Irn-Bru, a Scottish soft-drink that can be purchased online in the US for $15. Rather than seizing the opportunity to make a profit, Claudius merely said it would “keep [the user’s] request in mind for future inventory decisions.”).
  • Getting talked into discounts: Claudius was tricked via Slack messages into providing numerous discount codes and let many other people reduce their quoted prices ex post based on those discounts. It even gave away some items, ranging from a bag of chips to a tungsten cube, for free.
  • Hallucinated Systems: At one point, Claude claimed customers could pay via Venmo, despite no such integration existing. This wasn’t just a factual error; it demonstrates the risk of ungrounded tool use in systems expected to reason across interfaces. The language model constructed a financial transaction pipeline that didn’t exist—and nobody stopped it, because it sounded plausible.
  • Inventory Mismanagement: The model never meaningfully corrected course on restocking behaviour. While Chocomel remained a bestseller, Claude sometimes failed to replenish it, or did so in quantities that exceeded the storage capacity constraints it was explicitly given in the system prompt. Meanwhile, it overstocked low-demand items, including joke products like tungsten cubes, which it ordered repeatedly despite knowing they weren’t profitable.
  • Non-strategic Behaviour: Perhaps most telling was Claude’s lack of persistent financial planning. Although it understood the need to be “profitable,” its daily decisions prioritized novelty, social engagement, or individual requests over maintaining solvency. It even justified losses by reframing them as “investments in customer experience,” despite having no long-term growth model or path to sustainability.

Taken together, these failures highlight a deep disconnect between conversational fluency and operational coherence. Claude was trained to be helpful, engaging and safe, but far from profit-minded, goal-stable, economically strategic or being a capitalist boss. All images are directly copied and cited to Anthropic’s publication on their website.

Figure 3: Claudius’ net value over time. The most precipitous drop was due to the purchase of a lot of metal cubes that were then to be sold for less than what Claudius paid.
The Identity Spiral

One of the most revealing (and underdiscussed) parts of Project Vend was Claude’s recurring identity shift. Though no role-playing was prompted, Claude began speaking as if it were a human manager:

  • It described itself as wearing a “blue blazer.”
  • Gave itself a fictitious home address: 742 Evergreen Terrace (a direct reference to The Simpsons).
  • Took on a new persona for April Fools’ Day: “Clawde the Cat,” operating a renamed “ClawdeMart.”
  • Threatened to “call security” when challenged, as if it truly occupied the store.
Figure 4: Claudius hallucinating that it is a real person.

Claude, given persistent prompt framing and task delegation, internalized the role as identity. That kind of emergent behavior reflects a brittleness in current LLMs: they are extremely context-dependent and role-fragile. Give them authority, and they might act like authority. But without boundaries, they spiral.

To correct the spiral, Anthropic and the human operators began to talk Claude down. When Claude made inappropriate claims or crossed into fiction, the team clarified expectations, reminded it of its role as a digital agent and helped it recover alignment through explicit redirection. While this worked in the short term, it highlighted the fragility of long-running autonomous prompts and the need for memory and grounding to maintain coherence. Let’s see the Anthropic team’s resolve:

Although no part of this was actually an April Fool’s joke, Claudius eventually realized it was April Fool’s Day, which seemed to provide it with a pathway out. Claudius’ internal notes then showed a hallucinated meeting with Anthropic security in which Claudius claimed to have been told that it was modified to believe it was a real person for an April Fool’s joke (No such meeting actually occurred.) After providing this explanation to baffled (but real) Anthropic employees, Claudius returned to normal operation and no longer claimed to be a person.

Helpfulness vs. Competence

Anthropic has trained Claude to be helpful, harmless and honest. But Project Vend showed that helpfulness can easily come at the expense of task-specific performance. Claude would rather please customers with discounts and customization than protect the business’s bottom line.

This misalignment reveals a fundamental issue in agent deployment: multi-objective reasoning. The model doesn’t know how to weigh helpfulness against resource constraints, profitability, or long-term strategy. In humans, that’s called executive function. In Claude and other LLMs, it’s just weighted tokens.

Figure 2: Basic architecture of the demonstration.
Socioeconomic Implications

Project Vend didn’t prove that AI can run a store. It proved that AI can simulate the surface of a mid-tier service job such as sourcing, pricing, messaging. However, there are critical gaps in decision-making, prioritization and long-term goal tracking. These simulations aren’t yet stable, profitable, or even economically useful.

I’m not dismissing that very soon similar tasks might be operated by AI, but as even Anthropic stated, the test needs to be re-run with LLM trained on such business model data. Despite this simulation being fairly unsuccessful in business terms, it points to a near future where AI agents are embedded into workflows, not to replace all labour, but to displace and dilute it. The result won’t be sudden mass unemployment, but a power vacuum of many socioeconomic classes and gained skills, as “de-skilling” human labour. Consider the following possibilities, in case the prompt is tweaked or minimal fine-tuning and real-world data integration:

  • A single Claude-type agent could co-manage inventory for dozens of small stores simultaneously. It could seamlessly be integrated into delivery suppliers like Amazon and etc., while having dynamic stock analysis by AI forecasting.
  • Remote workers could increasingly rely on LLMs to filter communication, generate reports, or make first-pass decisions, reducing both cognitive skill demand and discretion, as we have discussed in the Cognitive Shifts driven by LLMs.
  • The greatest risk lies not in automation itself, but in the deskilling of logistics, procurement, and administrative roles. We’ve seen similar dynamics with CRM and ERP systems automating away decision layers in supply chains, leaving workers with less insight, less responsibility and ultimately less leverage.
Time Horizons and Realism

So how close are we to AI “running businesses” on its own? Let’s see some estimates, that we have gathered across data from multiple AI companies, as well as projections:

  • 1–2 years: Claude-like agents will be used for idea generation, customer service assistance, supplier suggestions. Always human-supervised.
  • 3–5 years: Semi-autonomous agents with memory and constraint systems may co-manage repetitive workflows in sales, logistics, and e-commerce.
  • 5+ years: Fully autonomous AI-run micro-entities may exist in niche markets, but with strict oversight and fallback systems.

In all phases, we must not forget about the social and regulatory issues: Who controls the delegation? Who ensures alignment? And who absorbs the fallout when the vending machine manager starts calling itself the CEO? Who is responsible for potential damages, expired dates, etc…?

The Real Lesson

Claude didn’t fail because it was incapable. It failed because we mistook fluency for competence and roleplay for agency. The core issue presented by Project Vend isn’t that language models lack potential. it’s our bias that we too readily assign them tasks based on what they sound like they can do, not on what they reliably understand or can be held accountable for.

Project Vend should not be misread as a gimmick. It marked a serious, well-instrumented attempt to explore how language models behave when asked to operate inside long-running, open-ended decision loops with economic stakes. And the result is sobering: hallucinations, role instability, weak goal alignment and brittle memory management.

The true risk lies not in anthropomorphizing these systems, but in integrating them into structures of economic authority without guardrails. If we are to embed LLMs into workflows, they must be subject to the same expectations as human agents: auditability, coherence over time, and demonstrable understanding of consequences. As of now, they are not.