Context Engineering Guide: Beyond Prompting

The prompt that worked in the demo breaks in production

You've seen it. The demo is flawless — the model follows instructions, retrieves the right information, responds exactly as intended. Everyone in the room is impressed. Three weeks later, the same system is in staging and it's hallucinating policy details, losing track of conversation history, and confidently answering questions it should be deferring.

Nothing changed in the prompt. So what broke?

The prompt was never the whole picture. In a controlled demo with a short input and a single turn, a well-written prompt carries you a long way. But the moment you add memory, tool calls, retrieved documents, and multi-turn conversation — the prompt becomes a small fraction of what the model actually sees. What determines quality at that point isn't how cleverly you phrased the instruction. It's everything else that surrounds it.

That "everything else" has a name now: context engineering.

What is context engineering — and why now?

Context engineering is the discipline of deciding what information goes into an LLM's context window, in what form, in what order, and at what point in a conversation or workflow.

The term gained mainstream traction in June 2025, when Shopify CEO Tobi Lütke described it as "the art of providing all the context for the task to be plausibly solvable by the LLM." Andrej Karpathy — formerly of OpenAI and Tesla — amplified the idea, calling it "the delicate art and science of filling the context window with just the right information for the next step." His framing stuck because it put a precise name on something practitioners had been doing for years without a shared vocabulary.

The analogy Karpathy offers is useful: think of the LLM as a CPU and its context window as RAM. Your job, as the person building on top of it, is to act like an operating system — loading exactly the right data into working memory for whatever task the model is about to execute. Load too little and it doesn't have what it needs. Load too much irrelevant information and performance degrades while costs climb.

This isn't a rebrand of prompt engineering. It's a different layer of the problem entirely.

Prompt engineering isn't dead — it just got a smaller job

Prompt engineering still matters. For self-contained, single-turn tasks — summarisation, classification, one-shot code generation — a well-constructed prompt is often all you need. Getting your instructions clear, your output format specified, and your few-shot examples right will take you far in those scenarios.

But most production AI systems aren't self-contained. They have memory across sessions. They call external tools. They retrieve documents from vector stores. They run as agents that make multiple LLM calls in sequence, each one depending on the output of the last.

In those systems, the prompt is maybe 10% of what the model actually sees at inference time. The other 90% is assembled dynamically at runtime — retrieved chunks from Pinecone or Weaviate, conversation history compressed by a summarisation step, tool call results piped back in, system instructions layered on top. The model doesn't just receive your prompt. It receives your prompt inside a carefully (or carelessly) constructed information environment.

That environment is what context engineering controls.

A brilliant prompt surrounded by a poorly assembled context will underperform. A simple prompt embedded in a clean, well-structured context will often outperform it. The teams that have figured this out are the ones shipping AI systems that actually hold up in production.

The four pillars of context engineering

Context engineering isn't one technique — it's a set of decisions that together determine what the model sees. These decisions break down into four areas.

Memory management. LLMs have no persistent memory by default. Every call starts fresh. Managing memory means deciding what to store, what to summarise, and what to discard between turns. In practice, this often means a tiered approach: recent turns stay verbatim in the context, older turns get compressed by a summarisation step, and long-term facts (user preferences, prior decisions) live in a separate store and get retrieved selectively. Tools like LangChain and LlamaIndex have built-in memory primitives, but how you configure them — what gets saved, what gets dropped, what triggers retrieval — is a design decision that directly affects output quality.

Retrieval-Augmented Generation (RAG). This is the most talked-about pillar, and the most commonly misunderstood. RAG isn't just "search before you generate." Done well, it's a precision retrieval problem: which chunks of which documents, at what granularity, ranked by what relevance signal, get injected into the context for this specific query. Get the chunking wrong and you retrieve fragments that miss critical surrounding context. Get the retrieval ranking wrong and you surface plausible but irrelevant documents. The model doesn't know the difference — it works with whatever it's given. (If you're weighing RAG against fine-tuning for your use case, our breakdown covers the tradeoffs in detail.)

Tool definitions and state. When an agent has access to tools — search APIs, calculators, CRM lookups, code executors — those tool definitions take up context space. More importantly, the results they return get fed back into the context, where they compete for attention with everything else. Poorly structured tool outputs, or too many active tools at once, are a common source of degraded agent performance that gets mistakenly blamed on the underlying model.

Structured system instructions. The system prompt is the layer most teams spend the most time on, and often the one that's least systematically managed. In production systems handling diverse requests, a static system prompt becomes a liability. Dynamic system instructions — assembled from modular components based on the task, user role, or conversation state — give you far more control. They also help you keep the total context lean by only including constraints and personas relevant to the current interaction.

These four pillars together determine the information environment your model operates in. Optimise them well and you're doing context engineering. Leave them to chance and you're hoping your prompt is good enough to compensate.

Context rot: the silent killer of AI reliability

Here's a failure mode that rarely gets named but shows up constantly in production AI systems: context rot.

It happens when the context window fills up with information that was once relevant but no longer is — or was never relevant to begin with. As the context grows longer, the model's attention gets diluted across more content. Signal-to-noise ratio drops. The model starts weighting older or less relevant information alongside critical facts, and output quality degrades in ways that feel random but aren't.

Context rot is particularly brutal in agentic systems with long-running workflows. An agent that has made fifteen tool calls over a complex task has fifteen sets of results in its context, including the ones from step two that are now completely irrelevant to step fifteen. The model sees all of it. In research settings, this phenomenon — sometimes called "lost in the middle" — has been shown to cause models to systematically underweight information that appears in the middle of a long context, even when that information is critical to the task.

The fix isn't a larger context window. A 200K token context with 150K tokens of noise is worse than a 20K context that's tightly curated.

Fighting context rot comes down to three practices. First, active pruning: regularly removing tool results, prior turns, and retrieved chunks that are no longer relevant to the current task step. Second, progressive summarisation: instead of keeping full conversation history verbatim, periodically compress older turns into a dense summary that preserves key decisions and facts without the verbatim detail. Third, tiered memory architecture: separating working context (what the model sees right now) from episodic memory (what happened earlier in this session) from long-term storage (persistent facts about users or systems), and pulling from each tier selectively based on what the current step actually needs.

Context rot is invisible until it costs you. Getting ahead of it is one of the highest-leverage things you can do to improve the reliability of any production AI system.

What this looks like in practice

Consider an AI system designed to help users understand complex documents — insurance policies, legal contracts, technical specifications. On the surface, it sounds like a straightforward RAG application: user asks a question, system retrieves relevant chunks, model generates an answer.

In reality, the context engineering challenges are significant. The user might ask a question that requires cross-referencing three separate sections of a 40-page document. Retrieved chunks that are too small lose the surrounding clause structure that changes the meaning entirely. Retrieved chunks that are too large flood the context with boilerplate text that buries the signal. The system also needs to maintain conversation state — if the user asked about claim limits two turns ago and is now asking a follow-up, the model needs that prior exchange to answer accurately. And it needs to express uncertainty appropriately when retrieved content genuinely doesn't contain the answer, rather than confabulating.

Getting this right requires decisions at every layer: chunk size and overlap strategy, embedding model selection, reranking at retrieval, conversation memory structure, and how tool outputs feed back into the next turn. The prompt is almost the last thing you tune.

When we built CoverWise — an AI document intelligence system for insurance policy queries — the majority of engineering effort went into exactly these layers. The result was a system capable of returning cited, accurate answers in under five seconds on complex multi-clause queries. The prompt mattered, but it was the context architecture that made it reliable. (For a broader look at what agentic AI systems demand from an engineering perspective, this piece on agentic AI for enterprises covers the production realities in depth.)

How to start: context engineering in practice

If you're building or maintaining an LLM-based system, context engineering probably isn't something you bolt on later. It's most effective when it shapes your architecture from the start. But even in existing systems, there are concrete steps you can take.

Audit what's actually in your context window. Most teams are surprised when they log and inspect the full context at inference time. You'll typically find redundant instructions, irrelevant retrieved chunks, and conversation history that's grown longer than anyone realised. Start with visibility — you can't optimise what you can't see.

Move from static to dynamic retrieval. If your RAG setup retrieves the same fixed number of chunks regardless of query type, you're leaving quality on the table. Queries that require precise factual lookup need different retrieval behaviour than queries that need broad thematic coverage. Implementing query classification as a first step — and adjusting retrieval strategy accordingly — is a relatively contained change that has an outsized effect.

Implement tiered memory. Not everything in a conversation deserves to stay in the active context. Building even a simple two-tier system — verbatim recent turns plus a compressed summary of older turns — measurably reduces context rot in multi-turn applications.

Treat your system prompt as a managed artifact. Version control it, test changes against a consistent evaluation set, and where possible, make it modular so that you can compose context-appropriate instructions rather than maintaining one monolithic prompt that tries to handle every scenario.

These aren't weekend projects — particularly in systems with complex tool use or high-volume multi-agent workflows. But the teams who get this infrastructure right stop chasing mysterious output quality problems and start shipping AI products that users can actually depend on. If you're working through this architecture and want an experienced team to pressure-test your approach, Artinoid's AI engineering practice works on exactly these problems.

This is infrastructure, not a trick

The teams winning with AI in 2026 aren't writing better prompts. They're building better information systems.

Context engineering is the recognition that an LLM is only as good as what you put in front of it — and that in any non-trivial application, what you put in front of it is a system design problem, not a copywriting problem. The shift from thinking about prompts to thinking about context is the same shift that happened in software engineering when we moved from writing individual functions to thinking about data architecture. The functions still matter. The architecture is what determines whether the system holds together.

If your AI application works in demos but degrades in production, the answer probably isn't a better prompt. It's a better context.

Artinoid builds production-grade AI systems — LLM applications, RAG pipelines, and agentic workflows — for businesses that need AI that actually works under real conditions. Talk to our team about what you're building.