Uber burned through its entire 2026 AI budget by April. Microsoft revoked Claude Code licenses after individual engineers hit $2,000 a month in token costs. One company reportedly ran up a $500 million Claude bill in a single month after forgetting to set usage limits.
Now a startup with 13 employees thinks it has found the fix.
Engram emerged from stealth on June 23 with $98 million in funding at a $600 million valuation, backed by General Catalyst, Kleiner Perkins, Sequoia Capital, and OpenAI co-founder Andrej Karpathy. Its pitch: a "learned memory layer" for enterprise AI that matches or outperforms frontier models while using up to 100 times fewer tokens. Microsoft, Notion, and Harvey are already testing it in production.
If Engram delivers, it could fundamentally change the economics of enterprise AI deployment. If it doesn't, it's a $600 million bet on a problem that model providers themselves might solve with cheaper inference. Either way, the timing is extraordinary — enterprise AI bills have risen 320% since 2024 despite a 98% drop in per-token prices. Something has to give.
What Engram Actually Does
The core problem Engram targets is deceptively simple: every time an AI model processes a request inside an enterprise, it rediscovers the organization from scratch. It rereads the same documents, relearns the same context, and rebuilds the same institutional knowledge — and every token of that repeated work costs money.
"Whatever the AI knows about you is improvised on the spot — a sticky note about your past, a document pulled mid-conversation," said Dan Biderman, Engram's CEO and co-founder. "If we can anticipate your interactions, we can prepare memories ahead of time instead of pasting them on the fly."
Engram's approach separates the reasoning layer from the memory layer. Rather than stuffing more context into each prompt (which inflates token consumption), Engram's models study an organization's documents, workflows, and institutional knowledge in advance, compressing what they learn into a compact, continuously updating memory unique to each customer. The company calls this an "engram" — borrowing the neuroscience term for the trace a memory leaves in the brain.
The technical breakthrough centers on a problem of scale that CTO Sabri Eyuboglu quantified bluntly: "When an AI reads a 70,000-word legal contract — roughly 400 kilobytes of text — its internal memory of that document can swell past 100 gigabytes. That's 250,000 times larger than the original file, and a huge part of what makes AI slow and expensive to run. We do that studying once, ahead of time, training the model to compress everything it learns into a compact memory it can reuse on every query."
The founding team brings deep academic credentials to this problem. Biderman completed postdoctoral work at Stanford under Chris Ré, one of the most influential figures in modern machine learning and a co-founder of Engram. Eyuboglu created Cartridges, a method for turning large document collections into small, reusable memories. Co-founder Jessy Lin developed Active Reading, a technique that trains models to study material deeply rather than merely store it. Scott Linderman, a tenured Stanford professor, is a leading researcher on state space models — a fast-growing alternative to the transformer architecture designed for handling long information sequences efficiently.
Early enterprise partners aren't just testing the technology — they're betting production workloads on it.
"Our enterprise customers are running long-lived agents across their Notion workspaces, and that kind of always-on work can burn through tokens fast," said Simon Last, co-founder of Notion. "We're testing Engram's models inside our new custom agents, and we're already seeing them approach frontier quality while using an order of magnitude fewer tokens."
Harvey, the legal AI startup, sees a similar opportunity. "Law firms and enterprises hold a lot of unique knowledge," said co-founder Gabe Pereyra. "Soon every employee will rely on agents that are adding millions of tokens per day of new context — faster than context windows or search can keep up."
And the anchor partnership with Microsoft goes deepest: the two companies are testing Engram's models within Microsoft 365, with committed GPU capacity across Dapple and Azure for training at scale.
Why This Matters Now
The Token Cost Crisis Is Real — and Getting Worse
The market Engram is entering isn't hypothetical. Enterprise AI spending has become one of the most urgent problems in corporate technology.
Per-token prices have fallen 98% since 2022 — GPT-4-equivalent performance now costs roughly $0.40 per million tokens, down from $20 per million. Yet the average enterprise AI budget has grown from $1.2 million per year in 2024 to $7 million in 2026, a 320% increase.
The culprit is volume. Agentic AI tools have multiplied token consumption per task from $0.04 per interaction in 2023 to roughly $1.20 in 2026 — a 30x increase. Per-developer token consumption has risen 18.6x in nine months, according to Jellyfish research. Goldman Sachs projects global token usage will multiply 24x by 2030.
The data from Ramp's AI Index paints the picture in dollars: the median company spends $2,246 per month on AI tokens, but the average is $140,842 — pulled up by heavy spenders in the 99th percentile hitting $831,338 monthly. Token usage across businesses grew 1,001% from January 2025 to April 2026.
The word "tokens" appeared in 129 earnings calls in Q2 2026, up from 57 the prior quarter.
Even tech giants are struggling. Meta is imposing centralized spending controls on employee AI usage after internal token consumption approached billions of dollars. The FinOps Foundation reports companies running 3x over their entire 2026 token allocations by April.
For CTOs: An Architectural Question
The deeper question Engram raises is architectural. Today's enterprise AI stack treats every interaction as stateless — each query rebuilds context from scratch through retrieval-augmented generation (RAG), context stuffing, or prompt engineering. This worked when AI was a productivity overlay used by a handful of power users. It breaks when hundreds of agents run continuously across an organization, each one re-processing the same institutional knowledge on every request.
Engram's "learned memory" approach represents a third option beyond RAG and fine-tuning: train a model to internalize organizational knowledge into compressed, reusable state that persists across sessions and queries. If this works at scale, it changes the calculus on which workloads can economically run on AI.
For CFOs: A Cost Structure Shift
The financial implications cut both ways. If Engram delivers its claimed 90-99% token reduction, it could transform AI from a variable cost that scales linearly with usage into something closer to a fixed cost that amortizes over time. The more an organization uses Engram, the more specialized its memory becomes — and the cheaper each subsequent interaction gets.
But it also introduces a new dependency. Companies would be trading one vendor relationship (with model providers like OpenAI or Anthropic) for another, with Engram sitting between their data and their AI inference. The question of who owns the "memory" — and what happens if you want to switch providers — is one enterprise procurement teams should ask early.
Market Context: Memory Is the New Battleground
Engram isn't the only company pursuing AI memory. The space has rapidly grown to include multiple approaches:
Mem0 raised funding and counts 48,000+ GitHub stars, offering structured memory with entity relationships and multi-hop queries (Pro tier at $249/month). Zep uses temporal context graphs that track how facts change over time. Letta (formerly MemGPT) provides an OS-inspired tiered memory system within an agent runtime. Cognee combines graphs and vectors with data ingestion from 30+ sources.
But Engram's approach differs fundamentally from all of these. While existing memory systems store and retrieve — essentially sophisticated databases that sit alongside AI models — Engram trains models to learn organizational context and compress it into reusable internal state. The distinction is between a filing cabinet (storage) and an employee who's studied the filing cabinet and can answer questions from memory (learned knowledge).
"Most of the conversation around enterprise AI has focused on making models generally smarter," said Leigh Marie Braswell, partner at Kleiner Perkins. "Getting a model to truly remember a specific organization and its unique ways of working is the problem nobody had convincingly solved."
The competitive landscape also includes broader cost-optimization approaches. Model routing — automatically picking the cheapest adequate model for each task — is emerging as the primary cost lever for enterprises. Caching reduces redundant inference. The Linux Foundation is launching the Tokenomics Foundation to create open standards for AI token billing.
But none of these approaches address Engram's core thesis: the most expensive tokens are the ones you shouldn't have to spend in the first place.
Framework 1: Enterprise AI Memory Architecture Decision Matrix
Not every enterprise needs a learned memory layer. The right approach depends on your data complexity, scale, and cost tolerance. Use this matrix to evaluate which memory architecture fits your use case.
| Criteria | Context Stuffing | RAG (Retrieval-Augmented) | Memory Platforms (Mem0/Zep) | Model Fine-Tuning | Learned Memory (Engram) |
|---|---|---|---|---|---|
| Best for | Simple, low-volume tasks | Document search, Q&A | Conversational agents, personalization | Domain-specific accuracy | High-volume, recurring enterprise workflows |
| Token efficiency | Poor (10-50K tokens/query) | Moderate (5-15K tokens/query) | Good (2-5K tokens/query) | Excellent (minimal context needed) | Excellent (1-10% of baseline) |
| Setup complexity | None | Medium (vector DB, chunking, embeddings) | Medium (API integration, scoping) | High (training data, compute, evaluation) | Medium-High (org data mapping, training cycle) |
| Knowledge freshness | Real-time | Near-real-time (re-index on change) | Session-to-session | Stale until retrained | Continuously improving |
| Organizational learning | None — stateless | None — retrieval only | Partial — stores facts | Static — frozen at training time | Deep — compounds over use |
| Cost at 500 agents | $50,000-150,000/mo | $20,000-60,000/mo | $10,000-30,000/mo | $5,000-15,000/mo + training | $2,000-15,000/mo (claimed) |
| Data sovereignty | Provider-dependent | Self-hostable | Varies by vendor | You own the weights | You own the memory |
| Vendor lock-in risk | Low | Low-Medium | Medium | High (model-specific) | Medium-High (Engram-specific) |
When to choose each:
- Context stuffing: Prototyping, low-volume internal tools, <10 users
- RAG: Document-heavy workflows (legal search, support KB), where content changes frequently
- Memory platforms: Multi-session chatbots, customer-facing agents that need personalization
- Fine-tuning: Narrow, high-stakes domains (medical, financial) where accuracy > cost
- Learned memory: High-volume enterprise deployment (100+ agents), recurring workflows, cost is the primary bottleneck
Framework 2: Enterprise AI Token Cost Reduction Roadmap
Reducing enterprise AI costs isn't a single decision — it's a staged process. Here's a 12-week roadmap from quick wins to architectural transformation.
Phase 1: Visibility (Weeks 1-3) — Know What You're Spending
Before optimizing, you need data. Most enterprises can't answer basic questions about their AI spend.
- Week 1: Deploy token-level observability (Datadog, New Relic, or Ramp's AI spend management). Map every AI-consuming workflow to a cost center.
- Week 2: Establish baselines — tokens per query, cost per workflow, cost per employee. Identify the top 10 workflows by total token spend.
- Week 3: Categorize spend by necessity: (a) high-value, irreducible; (b) high-volume, optimizable; (c) low-value, eliminable.
Success criteria: CFO can answer "how much are we spending on AI, where, and is it growing?"
Phase 2: Quick Wins (Weeks 4-6) — Cut Waste Without Changing Architecture
These interventions typically deliver 30-50% cost reduction with minimal engineering effort.
- Implement caching: Ramp data shows caching can reduce effective per-token rates significantly below list prices. Identify repeated queries and cache responses.
- Enable model routing: Route simple tasks (classification, extraction, summarization) to lightweight models (GPT-5-nano at $0.07/M tokens vs. GPT-5.5 at $1.42/M). This one change can cut costs 10-20x on eligible tasks.
- Set per-user and per-workflow caps: Pylon, Coinbase, and Walmart have all implemented token budgets. Start with the top 10% of spenders.
- Eliminate redundant processing: Audit agentic workflows for unnecessary re-reads. A single agent loop that re-fetches the same document 5 times per run can cost 5x what it should.
Success criteria: 30-50% reduction in total token spend, zero impact on output quality.
Phase 3: Architectural Optimization (Weeks 7-10) — Change How AI Remembers
This is where memory-layer solutions like Engram, Mem0, or Zep enter the picture.
- Week 7-8: Identify candidate workflows — high-volume, recurring, context-heavy tasks where the same organizational knowledge is reprocessed repeatedly. Legal document review, customer support escalation, and internal knowledge queries are prime targets.
- Week 9: Run a controlled pilot — deploy memory-layer technology on 2-3 workflows and measure: tokens per query (before vs. after), response quality (blind evaluation), latency impact, and total cost.
- Week 10: Evaluate pilot results against the decision matrix above. Calculate projected ROI at full deployment scale.
Success criteria: Pilot shows >50% token reduction with equivalent or better response quality on target workflows.
Phase 4: Scale and Govern (Weeks 11-12) — Build the Operating Model
- Establish AI FinOps function: Assign ownership of AI cost management (98% of FinOps teams now manage AI spend, up from 31% in 2025).
- Create an AI cost allocation model: Charge AI costs back to business units based on usage, creating natural incentives for efficiency.
- Set up continuous optimization: Monthly reviews of token spend vs. business value, with automatic alerts when workflows exceed cost thresholds.
- Plan for agentic scale: As Goldman Sachs projects 24x token growth by 2030, ensure your architecture can handle 10x current volume without 10x cost.
Success criteria: AI cost growth rate is lower than AI value creation rate, with clear attribution and accountability.
Case Study: The $500 Million Claude Bill and What It Teaches
The most dramatic illustration of the problem Engram targets came in May 2026, when reports emerged of a company that ran up a $500 million Anthropic bill in a single month after failing to set usage limits.
While the specifics are unverified, the scenario is entirely plausible. Here's the math: an enterprise running 1,000 agentic workflows, each processing an average of 50,000 tokens per interaction, running 100 interactions per day, at Claude Opus rates of approximately $15 per million input tokens and $75 per million output tokens, could generate a monthly bill exceeding $100 million. Add uncapped developer experimentation, no model routing, and no caching — and $500 million becomes arithmetically achievable.
The lesson isn't that AI is too expensive. It's that enterprise AI without memory, without cost governance, and without architectural discipline is too expensive. Every token spent re-reading a document the model already processed yesterday is a token wasted. Every agentic loop that rebuilds context from scratch instead of drawing on learned organizational knowledge is money burned.
This is the market opportunity Engram is targeting: not making AI cheaper per token (the model providers are doing that), but making AI smarter about which tokens it needs in the first place.
What to Do About It
For CIOs and CTOs
- Audit your token architecture now. If you can't quantify how much context your AI systems rebuild from scratch on every query, you're flying blind. Start with the top 10 workflows by token spend.
- Evaluate memory-layer solutions. Whether it's Engram, Mem0, Zep, or a custom implementation, the question isn't whether you need persistent AI memory — it's which approach fits your data complexity and scale.
- Demand benchmarks. Engram's 100x efficiency claim is extraordinary. Before committing, require controlled A/B tests on your actual workflows with your actual data. Measure both cost and quality.
For CFOs and Finance Leaders
- Model AI costs as a portfolio, not a line item. Different workflows have different cost profiles. A $7M annual AI budget should be broken into 20+ cost centers, each optimizable independently.
- Watch for the memory lock-in question. If Engram builds a proprietary memory of your organization, switching costs become significant. Negotiate data portability terms early.
- Benchmark against Ramp's data. If your PEPM (per-employee-per-month) AI spend exceeds $130, you're in the top quartile. That's not inherently bad — but you should know why.
For Business Leaders
- Don't wait for the technology to mature. The 12-week roadmap above starts with visibility and quick wins that require zero new vendor relationships. Phases 1-2 alone can cut AI costs 30-50%.
- Think about AI memory as a competitive asset. Engram's "sovereign AI" thesis — that the more you use it, the more proprietary your AI becomes — has strategic implications. Companies that invest in organizational AI memory early could build moats their competitors can't easily replicate.
- Budget for 10x. Goldman Sachs projects 24x token usage growth by 2030. If your AI architecture can't handle 10x current volume at less than 10x current cost, you have an architecture problem, not a budget problem.
