Context Engineering: The next frontier in building reliable AI systems 

As large language models (LLMs) like GPT, Claude, and Gemini become foundational to AI applications, the question is no longer if we use them, but how we best architect their use. Traditional prompt engineering, crafting clever instructions to coax desired outputs, has taken us far. Yet, as real-world applications grow more complex, this approach reveals fundamental limitations. Enter Context Engineering: a rapidly evolving discipline focused on designing dynamic, structured information environments that empower LLMs to perform consistently, reliably, and at scale. 

In this post, we unpack what context engineering is, why it matters, how it differs from prompt engineering, common pitfalls, and the techniques and tools driving this critical advance. 

What is Context Engineering? 

At its core, context engineering is the art and science of giving an LLM the right information, in the right format, at the right time to accomplish a task. 

(See our previous article: “Software 3.0: How Large Language Models Are Reshaping Programming and Applications” on https://novelis.io/research-lab/software-3-0-how-large-language-models-are-reshaping-programming-and-applications/

Where prompt engineering is about crafting static or one-off instructions, context engineering embraces the complexity of dynamic systems that manage how information flows into the model’s context window, the LLM’s working memory where it ‘sees’ and reasons about data. 

(Source: https://github.com/humanlayer/12-factor-agents/) 

As an analogy, Andrej Karpathy in his talk (shared above) describes the LLM as a CPU and the context window as its RAM, limited, precious working memory that must be carefully packed to maximize performance. Context engineering is precisely about optimizing this RAM usage to enable sophisticated, multi-step AI applications. 

More formally, context includes everything an AI needs to reason well: notes, reference materials, historical interactions, external tool outputs, and explicit instructions on output format. Humans naturally curate and access such information; for AI, we must explicitly engineer this information environment. 

Context Engineering vs. Prompt Engineering: Key differences 

While closely related, these disciplines differ fundamentally: 

  • Prompt Engineering is about what to say, crafting instructions or questions to get the model to respond appropriately. It works well for direct, interactive chatbot scenarios. 
  • Context Engineering is about how to manage all the information and instructions the AI will need across complex tasks and varying conditions. It involves orchestrating multiple components in a system, ensuring the AI understands and applies the right knowledge in the right moment. 

Put simply: 
Prompt engineering is like explaining to someone what to do
Context engineering is like ensuring they have the right tools, background, and environment to actually do it reliably

This shift is critical because AI agents cannot simply “chat until they get it right.” They require comprehensive, self-contained context, encompassing all possible scenarios and necessary resources, encoded and managed dynamically. 

Why Context Engineering matters 

Context engineering is essential for production-grade AI applications for several reasons: 

  • Reliability at Scale: Without careful context management, models forget or hallucinate crucial details mid-interaction. Compliance requirements or customer-specific data can be lost, risking inconsistent or even harmful outputs. 
  • Handling Complexity: Real-world AI apps involve layers of knowledge, tools, and user history. Prompts balloon into tangled instructions that lose coherence and potency. Context engineering provides the instruction manual and architecture for the AI agent to leverage everything effectively. 
  • Overcoming Prompt Limitations: Single-prompt approaches struggle when juggling multiple responsibilities or maintaining session memory. Context engineering enables modular, scalable memory, long-term knowledge, and tool integration. 

In short, for companies building AI-powered products, especially AI agents that must perform multi-step reasoning and interact with external systems, context engineering is make or break

Core components of AI Agents 

Context engineering coordinates six fundamental components to build capable AI agents: 

  1. Model: The foundational LLM (e.g., GPT-4, Claude, Gemini). 
  1. Tools: External APIs or systems the agent can query (e.g., calendar, search). 
  1. Knowledge and Memory: Persistent storage of relevant data, session history, and learned patterns. 
  1. Audio and Speech: Interfaces enabling natural interaction. 
  1. Guardrails: Safety and ethical controls. 
  1. Orchestration: The systems managing deployment, monitoring, and ongoing improvement. 

The context engineer’s role is to define how these parts work together through precisely crafted context, detailed prompts and structured data that govern the agent’s behavior. 

Common failure modes in Context Engineering 

Understanding failure modes helps engineers build more robust systems: 

  • Context Poisoning: Erroneous or hallucinated info contaminates the context and propagates. For example, an AI playing Pokémon hallucinating game states, then using that misinformation in future decisions. 
  • Context Distraction: Overloading with excessive history causes the model to ‘overfocus’ on past details and underperform on new reasoning. 
  • Context Confusion: Irrelevant or contradictory information pollutes the context, reducing reasoning effectiveness. Models may call irrelevant tools or misinterpret instructions. 
  • Context Clash: Conflicting information in the context leads to unstable outputs. Studies report up to 39% performance drops when incremental, conflicting data accumulate versus providing all info cleanly. 

Proper engineering anticipates and mitigates these pitfalls through careful context curation and system design. 

Techniques for effective Context Engineering 

To ensure the LLM’s working memory contains exactly the right information when it needs it, context engineering leverages four foundational techniques: 

  1. Writing Context: Creating persistent, external memory stores (“scratch pads”) for intermediate notes, long-term knowledge, and learned user preferences. 
  1. Selecting Context: Intelligently retrieving only relevant information for the current step, using embedding-based semantic search or Retrieval Augmented Generation (RAG) methods to reduce noise and improve accuracy. 
  1. Compressing Context: Summarizing or hierarchically compressing history to retain essential info without exceeding token limits, like capturing meeting minutes rather than every word. 
  1. Isolating Context: Separating different tasks or information streams via multi-agent architectures or sandboxed threads to prevent interference and maintain clarity. 

Together, these allow AI applications to scale beyond brittle, monolithic prompts toward modular, maintainable systems. 

Tools and frameworks powering Context Engineering 

Successful context engineering requires an integrated stack rather than standalone tools: 

  • Declarative AI Programming (e.g., DSPI): Define what you want, not how to ask, enabling high-level prompt compilation. 
  • Control Structures (e.g., LangGraph): Graph-based workflows with conditional logic and state persistence for complex context orchestration. 
  • Vector Databases (e.g., Pinecone, Weaviate, Chroma): Semantic search over massive documents, memories, and examples for precise context selection. 
  • Structured Generation Tools: Enforce consistent, parseable outputs (e.g., JSON, XML) to prevent corruption. 
  • Advanced Memory Systems (e.g., MEZero): Intelligent layers that manage information relationships, expiration of outdated facts, and irrelevant info filtering. 
  • Orchestration Platforms: Glue everything together, models, databases, APIs, while maintaining context coherence and monitoring. 

These frameworks help move from ad hoc prompt hacks to systematic, scalable AI development. 

(Source: “Context Engineering vs Prompt Engineering” by Mehul Gupta) 

Conclusion 

Context engineering marks a fundamental evolution in AI system design. It shifts the focus from crafting instructions to building dynamic information ecosystems that ensure large language models receive the precise data they need, when they need it and in the proper form, to perform reliably and robustly. 

For anyone building sophisticated AI applications or agents, mastering context engineering is no longer optional, it’s essential. By addressing the failure modes of prompt-only systems and leveraging emerging tools and architectures, context engineering unlocks the true potential of LLMs to become versatile, dependable partners in complex real-world tasks. 

Further Reading & Resources 

  • The New Skill in AI is Not Prompting, It’s Context Engineering 
  • 12-Factor Agents – Principles for building reliable LLM applications 
  • Context Engineering for Agents 
  • A Survey of Context Engineering for Large Language Models 

Novelis at Big Data & AI Paris 2025 

Join us on October 1–2, 2025, at Pavilion 3 – Paris Expo Porte de Versailles for the 14th edition of the leading trade show dedicated to Big Data and AI. The event brings together 220+ exhibitors, 300+ speakers, 17,000 attendees, and 7 open demo stages

Why this event matters 

  • It’s the merger of two major events – Big Data and AI Paris – offering a unified vision of data and intelligence. 
  • The Innovative Minds program (summits, case studies, technical expertise) highlights digital sovereignty, responsible AI, and open-source technologies. 
  • A rich format with Leaders Talks, Use Cases, and Expert Sessions for a deep dive into strategic and technical challenges. 

Novelis at the event 

As a member of Hub France AI, Novelis will have a stand in the Startup Village, a space designed to showcase innovative companies and foster connections with decision-makers, investors, and peers. 

On Wednesday, October 1st at 12:30 PM, Mehdi Nafe (CEO, Novelis) and Walid Dehhane (CTO, Novelis) will take the stage to pitch on: 
“How AI is revolutionizing interactions with Information Systems.” 

3 good reasons to meet us 

High-level pitch: strategy, business value, and technology explained clearly. 
Business networking: connect, exchange, and meet potential partners or clients. 
Market insights: capture the latest trends and real expectations from IT and data professionals. 

Don’t miss Big Data & AI Paris 2025, a unique opportunity to accelerate digital transformation. 
Meet Novelis at Stand ST23 – our leaders Walid and Mehdi will be there to share how AI is reshaping interactions with Information Systems. 

Get your pass now! 

Software 3.0: How Large Language Models Are Reshaping Programming and Applications 

Andrej Karpathy’s talk, “Software Is Changing (Again),” outlines how Large Language Models (LLMs) are revolutionizing how we build, interact with, and think about software. From the shift in programming paradigms to new opportunities in partial autonomy apps, Karpathy’s talk maps a path for developers, businesses, and technologists navigating this rapidly evolving landscape. 

In this article, we’ll break down the key ideas from Karpathy’s talk: how software has evolved into its third major phase, why LLMs are best understood as complex operating systems, the opportunities they unlock for application development, and what it means to build for agents in this new world. 

The Evolution of Software: From Traditional coding to Prompts 

Software can be categorized into three paradigms

Software 1.0: Traditional code written by humans (e.g., C++, Python, Java), where logic is explicitly programmed. 

Software 2.0: Neural networks, where logic emerges from training data rather than hand-coded rules. This shift allowed companies to replace explicit code with machine-learned components. 

Software 3.0: LLM-driven systems where prompts in natural language (English, French, Arabic, etc.) act as the code. Programming now means shaping the behavior of powerful language models with carefully crafted text inputs. 

Developers must become fluent in all three paradigms, each offers unique strengths and trade-offs. For exemple, for a sentiment classification task, here how the three paradigm compare: 

Large Language Models: The New Operating System 

LLMs are best viewed as OS (operating systems) for intelligence

Closed-source and open-source ecosystems resemble the early OS wars (Windows/macOS vs. Linux). Proprietary models like GPT and Gemini sit alongside open source ecosystems like LLaMA. 

LLMs as CPUs: The model is the compute engine, while the context window is akin to memory, shaping problem-solving within strict resource limits. 

1960s-style computing: LLM compute is expensive and centralized in the cloud, with users as thin clients. The future may eventually bring personal LLMs, but we’re not there yet. 

Interacting with an LLM today feels like using a terminal before the GUI era, powerful but raw. The “killer GUI” for LLMs has yet to be invented. 

LLM Psychology: Superhuman, Yet Flawed 

LLMs, he said, can be seen as stochastic simulations of people, capable of remarkable feats but prone to unique weaknesses: 

Superpowers: They possess encyclopedic knowledge and near-infinite memory of their training data. 

Cognitive deficits: LLMs hallucinate, lack persistent learning (anterograde amnesia), and sometimes make baffling errors (“jagged intelligence”). 

Security limitations: Their openness to manipulation makes them vulnerable to prompt injections and data leaks. 

The key to using LLMs effectively is building systems that leverage their strengths while mitigating their weaknesses, a human-in-the-loop approach. 

The Opportunity: Building Partial Autonomy Apps 

Direct interaction with LLMs will give way to dedicated applications that manage LLM behavior. For exemple, tools like Cursor (AI coding assistant) and Perplexity (LLM-powered search) orchestrate multiple models, manage context, and provide purpose-built GUIs. Apps should let users adjust the level of AI autonomy, from minor code suggestions to major repo changes.  The most useful apps speed up the cycle of AI generation and human verification, using visual GUIs to audit AI output efficiently. 

Karpathy warns against overly ambitious full autonomy. Instead, developers should focus on incremental, auditable steps

Natural Language Programming & “Vibe Coding” 

In the Software 3.0 world, everyone becomes a programmer

Natural language as code: Since LLMs are programmed via prompts, anyone fluent in English can shape software behavior. 

Vibe coding: Karpathy’s term for casually building useful apps without deep technical expertise, and a gateway to more serious software development. 

However, he highlights the gap: while LLMs make generating code easy, deploying real apps (auth, payments, deployment) is still manual, tedious, and ripe for automation. 

Building for Agents: The Next Frontier 

To truly harness AI agents, we need to adapt our digital infrastructure

LLM-friendly web standards: Analogous to robots.txt, Karpathy proposes llms.txt files or markdown docs that speak directly to LLMs. 

Structured data for agents: Move beyond human-centric docs (“click here”) to machine-readable instructions (curl commands, APIs). 

Tools for LLM ingestion: Solutions like get-ingest and DeepWiki make large codebases consumable by LLMs, enabling smarter agent behavior. 

The future will involve both improving agent capabilities and redesigning the digital world to make it more agent-friendly. 

The Decade of Agents: What Comes Next 

Karpathy concludes with a pragmatic vision: 2025 won’t be the year of agents, the 2020s will be the decade of agents. 

Building partial autonomy systems with an “Iron Man suit” design, AI that augments humans while offering tunable autonomy, is the most promising path forward. Success will come not from chasing full autonomy today, but from carefully engineering human-AI cooperation at every step. 

Conclusion 

Software is changing, quickly and radically. With LLMs as the new programmable platform, the barriers to software creation are falling, but the complexity of verification, deployment, and safe autonomy is rising. Karpathy’s talk challenges us to build tools, infrastructure, and applications that respect this balance, putting human oversight at the heart of the AI revolution. 

Inside the “Agent-as-a-Judge” framework

As AI evolves from static models to agentic systems, evaluation becomes one of the most critical challenges in the field. Traditional methods focus on final outputs or rely heavily on expensive and slow human evaluations. Even automated approaches like LLM-as-a-Judge, while helpful, lack the ability to assess step-by-step reasoning or iterative planning, essential components of modern AI agents like AI code generators. To address this, researchers at Meta AI and KAUST introduce an innovative paradigm: Agent-as-a-Judge: a modular, agentic evaluator designed to evaluate agentic systems holistically, not just by what they produce, but how they produce it.

Why traditional evaluation falls short

Modern AI agents operate through multi-step reasoning, interact with tools, adapt dynamically, and often work on long-horizon tasks. Evaluating them like static black-box models misses the forest for the trees. Indeed, final outputs don’t reflect whether the problem-solving trajectory was valid, human-in-the-loop evaluation is often costly, slow, and non-scalable, and LLM-based judging can’t fully capture contextual decision-making or modular reasoning.

Enter Agent-as-a-Judge

This new framework adds structured reasoning to evaluation by leveraging agentic capabilities to analyze agent behavior. It does so through specialized modules:

  • Ask: pose questions about unclear or missing parts of the requirements.
  • Read: analyze the agent’s outputs and intermediate files.
  • Locate: identify the relevant code or documentation sections.
  • Retrieve: gather context from related sources.
  • Graph: understand logical and structural relationships in the task.

We can think of it as a code reviewer with reasoning skills, evaluating not just what was built, but also how it was built.

DevAI: a benchmark that matches real-world complexity

To test this, the team created DevAI, a benchmark with 55 real-world AI development tasks and 365 evaluation criteria, from low-level implementation details to high-level functionality. Unlike existings benchmarks, these tasks represent the kind of messy, multi-layered goals AI agents face in production environments.

The results: Agent-as-a-Judge vs. Human-as-a-Judge & LLM-as-a-Judge

The study evaluated three popular AI agents (MetaGPT, GPT-Pilot, and OpenHands) using human experts, LLM-as-a-Judge, and the new Agent-as-a-Judge framework. While human evaluation was the gold standard, it was slow and costly. LLM-as-a-Judge offered moderate accuracy (around 70%) with reduced cost and time. Agent-as-a-Judge, however, achieved over 95% alignment with human judgments while being 97.64% cheaper and 97.72% faster.

Implications

The most exciting part? This could create a self-improving loop : agents that evaluate other agents to generate better data to train stronger agents. This “agentic flywheel” is an interesting and exciting vision for future AI systems that can critique, debug, and improve each other autonomously. Agent-as-a-Judge isn’t just a better way to score AI. It might very well be a paradigm shift in how we interpret agent behavior, detect failure modes, provide meaningful feedback and create accountable and trustworthy systems.

Further readings:

MCP: The Protocol Connecting AI Models to Your Applications and Tools

Artificial intelligence models continue to grow in power, but their effectiveness is often limited by one key factor: access to the right data at the right time. Each new source of information still requires specific, time-consuming, and fragile integration, thus limiting the real impact of LLMs.

To address this issue, Anthropic – the creator of the Claude model – introduced the Model Context Protocol (MCP), a universal protocol designed to standardize and secure connections between AI models and external data sources or tools. MCP aims to simplify and streamline bidirectional exchanges between AI assistants and work environments, whether local or remote.

A Simple Architecture Designed for Efficiency

The protocol is based on a streamlined yet powerful structure. It defines communication between AI systems, such as Claude or any other conversational agent, and MCP servers, which provide access to resources like files, APIs, or databases. These servers expose specific capabilities, and the AI systems dynamically connect to them to interact with the data as needed.

Specifically, MCP provides a detailed technical specification, SDKs to facilitate development and accelerate adoption, and an open-source repository containing preconfigured MCP servers that are ready to use. This approach aims to make the protocol accessible to developers while ensuring robust integration.

Rapid Adoption by Major Players

The technology is already attracting industry leaders. Claude Desktop, developed by Anthropic, natively integrates MCP. Google has also announced support for the protocol for its Gemini models, while OpenAI plans to integrate it into ChatGPT soon, both on desktop and mobile versions. This rapid adoption highlights MCP’s potential to become a standard for connected AI.

A New Standard for Connected AI

By establishing a common, persistent interface, MCP goes beyond the limitations of traditional APIs. While these typically operate through one-off, often disconnected calls, MCP allows AI agents to maintain a session context, track its evolution, and interact in a smoother, more coherent, and intelligent manner.

This ability to maintain a shared state between the model and tools enhances the user experience. Agents can anticipate needs, personalize responses, and learn from interaction history to adapt more effectively.

A Strategic Solution for Businesses

Beyond the technical innovation, MCP represents a strategic lever for organizations. It significantly reduces the costs associated with integrating new data sources while accelerating the implementation of concrete AI-based use cases. By facilitating the creation of interoperable ecosystems, MCP offers businesses greater agility in responding to the rapidly evolving needs of the market.

Exploring MiniMax-01: Pushing the boundaries of context lengths and model efficiency in LLMs

For LLMs (Large Language Models), the ability to handle large contexts is essential. MiniMax-01, a new series of models developed by MiniMax, presents significant improvements in both model scalability and computational efficiency, achieving context windows of up to 4 million tokens—20-32 times longer than most current LLMs. 

Key innovations in MiniMax-01: 

  1. Record-breaking context lengths: 
  1. MiniMax-01 surpasses the performance of models like GPT-4 and Claude-3.5-Sonnet, allowing for context lengths of up to 4 million tokens. This enables the model to process entire documents, reports, or multi-chapter books in one single inference step, without the need to chunk documents. 
  1. Lightning Attention and Mixture of Experts: 
  1. Lightning Attention: A linear-complexity attention mechanism designed for efficient sequence processing. 
  1. Mixture of Experts: A framework with 456 billion parameters distributed across 32 experts. Only 45.9 billion parameters are activated per token, to ensure minimal computational overhead while maintaining high performance. 
  1. Efficient Training and Inference: 
  1. MiniMax-01 utilizes a few parallelism strategies to optimize GPU usage and reduce communication overhead:  
  1. Expert Parallel and Tensor Parallel Techniques to optimize training efficiency. 
  1. Multi-level Padding and Sequence Parallelism to increase GPU utilization to 75%

MiniMax-VL-01: Also a Vision-Language Model 

In addition to MiniMax-Text-01, MiniMax has extended the same innovations into multimodal tasks with MiniMax-VL-01. Trained on 512 billion vision-language tokens, this model can efficiently process both text and visual data, making it also suitable for tasks like image captioning, image-based reasoning, and multimodal understanding

Practical Applications: 

The ability to handle 4 million tokens unlocks  potential across various sectors: 

  • Legal and Financial Analysis: Process complete legal cases or financial reports in a single pass. 
  • Scientific Research: Analyze large research datasets or summarize years of studies. 
  • Creative Writing: Generate long-form narratives with complex story arcs. 
  • Multimodal Applications: Enhance tasks requiring both text and image integration. 

MiniMax has made MiniMax-01 publicly available through Hugging Face

🔗 Explore MiniMax-01 on Hugging Face 

Debunking AI Criticism: Between Reality and Opportunity

“The emergence of generative artificial intelligence has marked a major turning point in the technological landscape, eliciting both fascination and hope. Within a few years, it has unveiled extraordinary potential, promising to transform entire sectors, from automating creative tasks to solving complex problems. This rise has placed AI at the center of technological, economic, and ethical debates.

However, generative AI has not escaped criticism. Some question the high costs of implementing and training large models, highlighting the massive infrastructure and energy resources required. Others point to the issue of hallucinations, instances where models produce erroneous or incoherent information, potentially impacting the reliability of services offered. Additionally, some liken it to a technological “bubble,” drawing parallels to past speculation around cryptocurrencies or the metaverse, suggesting the current enthusiasm for AI may be short-lived and overhyped.

These questions are legitimate and fuel an essential debate about the future of AI. However, limiting ourselves to these criticisms overlooks the profound transformations and tangible impacts that artificial intelligence is already fostering across many sectors. In this article, we will delve deeply into these issues to demonstrate that, despite the challenges raised, generative AI is far more than a fleeting trend. Its revolutionary potential is just beginning to materialize, and addressing these criticisms will shed light on why it is poised to become an essential driver of progress.”

Please fill out the form to download the document and learn more


El Hassane Ettifour, our Director of Research and Innovation, dives into this topic and shares his insights in this exclusive video.

Large Language Models versus Wall Street: Can AI enhance your financial investment decisions?

How do you determine which stocks to buy, sell, or hold? This is a complex question that requires considering multiple factors: geopolitical events, market trends, company-specific news, and macroeconomic conditions. For individuals or small to medium businesses, taking all these factors into account can be overwhelming. Even large corporations with dedicated financial analysts face challenges due to organizational silos or lack of communication.

Inspired by the success of GPT-4’s reasoning abilities, researchers from Alpha Tensor Technologies Ltd., the University of Piraeus, and Innov-Acts have developed MarketSenseAI, a GPT-4-based framework designed to assist with stock-related decisions—whether to buy, sell, or hold. MarketSenseAI provides not only predictive capabilities and a signal evaluation mechanism but also explains the rationale behind its recommendations.

The platform is highly customizable to suit an individual’s or company’s risk tolerance, investment plans, and other preferences. It consists of five core modules:

  1. Progressive News Summary – Summarizes recent developments in the company or sector, alongside past news reports.
  2. Fundamentals Summary – Analyzes the company’s latest financial statements, providing quantifiable metrics.
  3. Macroeconomic Summary – Examines the macroeconomic factors influencing the current market environment.
  4. Stock Price Dynamics – Analyzes the stock’s price movements and trends.
  5. Signal Generation – Integrates the information from all the modules to deliver a comprehensive investment recommendation for a specific stock, along with a detailed rationale.

This framework serves as a valuable assistant in the decision-making process, empowering investors to make more informed choices. Integrating AI into investment decisions offers several key advantages: it introduces less bias compared to human analysts, efficiently processes large volumes of unstructured data, and identifies patterns, outliers, and discrepancies that traditional analysis might overlook.

Ramsay Santé Optimizes Operations with Novelis

Customer Order Automation: A Successful Project to Transform Processes