Inside the “Agent-as-a-Judge” framework

As AI evolves from static models to agentic systems, evaluation becomes one of the most critical challenges in the field. Traditional methods focus on final outputs or rely heavily on expensive and slow human evaluations. Even automated approaches like LLM-as-a-Judge, while helpful, lack the ability to assess step-by-step reasoning or iterative planning, essential components of modern AI agents like AI code generators. To address this, researchers at Meta AI and KAUST introduce an innovative paradigm: Agent-as-a-Judge: a modular, agentic evaluator designed to evaluate agentic systems holistically, not just by what they produce, but how they produce it.

Why traditional evaluation falls short

Modern AI agents operate through multi-step reasoning, interact with tools, adapt dynamically, and often work on long-horizon tasks. Evaluating them like static black-box models misses the forest for the trees. Indeed, final outputs don’t reflect whether the problem-solving trajectory was valid, human-in-the-loop evaluation is often costly, slow, and non-scalable, and LLM-based judging can’t fully capture contextual decision-making or modular reasoning.

Enter Agent-as-a-Judge

This new framework adds structured reasoning to evaluation by leveraging agentic capabilities to analyze agent behavior. It does so through specialized modules:

  • Ask: pose questions about unclear or missing parts of the requirements.
  • Read: analyze the agent’s outputs and intermediate files.
  • Locate: identify the relevant code or documentation sections.
  • Retrieve: gather context from related sources.
  • Graph: understand logical and structural relationships in the task.

We can think of it as a code reviewer with reasoning skills, evaluating not just what was built, but also how it was built.

DevAI: a benchmark that matches real-world complexity

To test this, the team created DevAI, a benchmark with 55 real-world AI development tasks and 365 evaluation criteria, from low-level implementation details to high-level functionality. Unlike existings benchmarks, these tasks represent the kind of messy, multi-layered goals AI agents face in production environments.

The results: Agent-as-a-Judge vs. Human-as-a-Judge & LLM-as-a-Judge

The study evaluated three popular AI agents (MetaGPT, GPT-Pilot, and OpenHands) using human experts, LLM-as-a-Judge, and the new Agent-as-a-Judge framework. While human evaluation was the gold standard, it was slow and costly. LLM-as-a-Judge offered moderate accuracy (around 70%) with reduced cost and time. Agent-as-a-Judge, however, achieved over 95% alignment with human judgments while being 97.64% cheaper and 97.72% faster.

Implications

The most exciting part? This could create a self-improving loop : agents that evaluate other agents to generate better data to train stronger agents. This “agentic flywheel” is an interesting and exciting vision for future AI systems that can critique, debug, and improve each other autonomously. Agent-as-a-Judge isn’t just a better way to score AI. It might very well be a paradigm shift in how we interpret agent behavior, detect failure modes, provide meaningful feedback and create accountable and trustworthy systems.

Further readings:

MCP: The Protocol Connecting AI Models to Your Applications and Tools

Artificial intelligence models continue to grow in power, but their effectiveness is often limited by one key factor: access to the right data at the right time. Each new source of information still requires specific, time-consuming, and fragile integration, thus limiting the real impact of LLMs.

To address this issue, Anthropic – the creator of the Claude model – introduced the Model Context Protocol (MCP), a universal protocol designed to standardize and secure connections between AI models and external data sources or tools. MCP aims to simplify and streamline bidirectional exchanges between AI assistants and work environments, whether local or remote.

A Simple Architecture Designed for Efficiency

The protocol is based on a streamlined yet powerful structure. It defines communication between AI systems, such as Claude or any other conversational agent, and MCP servers, which provide access to resources like files, APIs, or databases. These servers expose specific capabilities, and the AI systems dynamically connect to them to interact with the data as needed.

Specifically, MCP provides a detailed technical specification, SDKs to facilitate development and accelerate adoption, and an open-source repository containing preconfigured MCP servers that are ready to use. This approach aims to make the protocol accessible to developers while ensuring robust integration.

Rapid Adoption by Major Players

The technology is already attracting industry leaders. Claude Desktop, developed by Anthropic, natively integrates MCP. Google has also announced support for the protocol for its Gemini models, while OpenAI plans to integrate it into ChatGPT soon, both on desktop and mobile versions. This rapid adoption highlights MCP’s potential to become a standard for connected AI.

A New Standard for Connected AI

By establishing a common, persistent interface, MCP goes beyond the limitations of traditional APIs. While these typically operate through one-off, often disconnected calls, MCP allows AI agents to maintain a session context, track its evolution, and interact in a smoother, more coherent, and intelligent manner.

This ability to maintain a shared state between the model and tools enhances the user experience. Agents can anticipate needs, personalize responses, and learn from interaction history to adapt more effectively.

A Strategic Solution for Businesses

Beyond the technical innovation, MCP represents a strategic lever for organizations. It significantly reduces the costs associated with integrating new data sources while accelerating the implementation of concrete AI-based use cases. By facilitating the creation of interoperable ecosystems, MCP offers businesses greater agility in responding to the rapidly evolving needs of the market.

Exploring MiniMax-01: Pushing the boundaries of context lengths and model efficiency in LLMs

For LLMs (Large Language Models), the ability to handle large contexts is essential. MiniMax-01, a new series of models developed by MiniMax, presents significant improvements in both model scalability and computational efficiency, achieving context windows of up to 4 million tokens—20-32 times longer than most current LLMs. 

Key innovations in MiniMax-01: 

  1. Record-breaking context lengths: 
  1. MiniMax-01 surpasses the performance of models like GPT-4 and Claude-3.5-Sonnet, allowing for context lengths of up to 4 million tokens. This enables the model to process entire documents, reports, or multi-chapter books in one single inference step, without the need to chunk documents. 
  1. Lightning Attention and Mixture of Experts: 
  1. Lightning Attention: A linear-complexity attention mechanism designed for efficient sequence processing. 
  1. Mixture of Experts: A framework with 456 billion parameters distributed across 32 experts. Only 45.9 billion parameters are activated per token, to ensure minimal computational overhead while maintaining high performance. 
  1. Efficient Training and Inference: 
  1. MiniMax-01 utilizes a few parallelism strategies to optimize GPU usage and reduce communication overhead:  
  1. Expert Parallel and Tensor Parallel Techniques to optimize training efficiency. 
  1. Multi-level Padding and Sequence Parallelism to increase GPU utilization to 75%

MiniMax-VL-01: Also a Vision-Language Model 

In addition to MiniMax-Text-01, MiniMax has extended the same innovations into multimodal tasks with MiniMax-VL-01. Trained on 512 billion vision-language tokens, this model can efficiently process both text and visual data, making it also suitable for tasks like image captioning, image-based reasoning, and multimodal understanding

Practical Applications: 

The ability to handle 4 million tokens unlocks  potential across various sectors: 

  • Legal and Financial Analysis: Process complete legal cases or financial reports in a single pass. 
  • Scientific Research: Analyze large research datasets or summarize years of studies. 
  • Creative Writing: Generate long-form narratives with complex story arcs. 
  • Multimodal Applications: Enhance tasks requiring both text and image integration. 

MiniMax has made MiniMax-01 publicly available through Hugging Face

🔗 Explore MiniMax-01 on Hugging Face 

Debunking AI Criticism: Between Reality and Opportunity

“The emergence of generative artificial intelligence has marked a major turning point in the technological landscape, eliciting both fascination and hope. Within a few years, it has unveiled extraordinary potential, promising to transform entire sectors, from automating creative tasks to solving complex problems. This rise has placed AI at the center of technological, economic, and ethical debates.

However, generative AI has not escaped criticism. Some question the high costs of implementing and training large models, highlighting the massive infrastructure and energy resources required. Others point to the issue of hallucinations, instances where models produce erroneous or incoherent information, potentially impacting the reliability of services offered. Additionally, some liken it to a technological “bubble,” drawing parallels to past speculation around cryptocurrencies or the metaverse, suggesting the current enthusiasm for AI may be short-lived and overhyped.

These questions are legitimate and fuel an essential debate about the future of AI. However, limiting ourselves to these criticisms overlooks the profound transformations and tangible impacts that artificial intelligence is already fostering across many sectors. In this article, we will delve deeply into these issues to demonstrate that, despite the challenges raised, generative AI is far more than a fleeting trend. Its revolutionary potential is just beginning to materialize, and addressing these criticisms will shed light on why it is poised to become an essential driver of progress.”

Please fill out the form to download the document and learn more


El Hassane Ettifour, our Director of Research and Innovation, dives into this topic and shares his insights in this exclusive video.

Large Language Models versus Wall Street: Can AI enhance your financial investment decisions?

How do you determine which stocks to buy, sell, or hold? This is a complex question that requires considering multiple factors: geopolitical events, market trends, company-specific news, and macroeconomic conditions. For individuals or small to medium businesses, taking all these factors into account can be overwhelming. Even large corporations with dedicated financial analysts face challenges due to organizational silos or lack of communication.

Inspired by the success of GPT-4’s reasoning abilities, researchers from Alpha Tensor Technologies Ltd., the University of Piraeus, and Innov-Acts have developed MarketSenseAI, a GPT-4-based framework designed to assist with stock-related decisions—whether to buy, sell, or hold. MarketSenseAI provides not only predictive capabilities and a signal evaluation mechanism but also explains the rationale behind its recommendations.

The platform is highly customizable to suit an individual’s or company’s risk tolerance, investment plans, and other preferences. It consists of five core modules:

  1. Progressive News Summary – Summarizes recent developments in the company or sector, alongside past news reports.
  2. Fundamentals Summary – Analyzes the company’s latest financial statements, providing quantifiable metrics.
  3. Macroeconomic Summary – Examines the macroeconomic factors influencing the current market environment.
  4. Stock Price Dynamics – Analyzes the stock’s price movements and trends.
  5. Signal Generation – Integrates the information from all the modules to deliver a comprehensive investment recommendation for a specific stock, along with a detailed rationale.

This framework serves as a valuable assistant in the decision-making process, empowering investors to make more informed choices. Integrating AI into investment decisions offers several key advantages: it introduces less bias compared to human analysts, efficiently processes large volumes of unstructured data, and identifies patterns, outliers, and discrepancies that traditional analysis might overlook.

Ramsay Santé Optimizes Operations with Novelis

Customer Order Automation: A Successful Project to Transform Processes 

Novelis attends the Artificial Intelligence Expo of Ministry of the Interior

On October 8th, 2024, Novelis will participate in the Artificial Intelligence Expo of the Digital Transformation Direction of the Ministry of the Interior.

This event, held at the Bercy Lumière Building in Paris, will immerse you in the world of AI through demonstrations, interactive booths, and immersive workshops. It’s the perfect opportunity to explore the latest technological advancements that are transforming our organizations!

Join Novelis: Turning Generative AI into a Strength for Information Sharing

We invite you to discover how Novelis is revolutionizing the way businesses leverage their expertise and share knowledge through Generative AI. At our booth, we will highlight the challenges and solutions for the reliable and efficient transmission of information within organizations.

Our experts – El Hassane Ettifouri, Director of Innovation; Sanoussi Alassan, Ph.D. in AI and Generative AI Specialist; and Laura Minkova, Data Scientist – will be present to share their insights on how AI can transform your organization.

Don’t miss this opportunity to connect with us and enhance your company’s efficiency!

[Webinar] Take the Guesswork Out of Your Intelligent Automation Initiatives with Process Intelligence 

Are you struggling to determine how to kick-start or optimize your intelligent automation efforts? You’re not alone. Many organizations face challenges in deploying automation and AI technologies effectively, often wasting time and resources. The good news is there’s a way to take the guesswork out of the process: Process Intelligence

Join us on September 26 for an exclusive webinar with our partner ABBYY, Take the Guesswork Out of Your Intelligent Automation Initiatives Using Process Intelligence. In this session, Catherine Stewart, President of the Americas at Novelis, will share her expertise on how businesses can use process mining and task mining to optimize workflows and deliver real, measurable impact.  

Why You Should Attend 

Automation has the potential to transform your business operations, but without the right approach, efforts can easily fall flat. Catherine Stewart will draw from her extensive experience leading automation initiatives to reveal how process intelligence can help businesses achieve efficiency gains, reduce bottlenecks, and ensure long-term success. 

Key highlights: 

  • How process intelligence can provide critical insights into how your processes are performing and where inefficiencies lie. 
  • The role of task mining in capturing task-level data to complement process mining, providing a complete view of your operations. 
  • Real-world examples of how Novelis has helped clients optimize their automation efforts using process intelligence, leading to improved efficiency, accuracy, and customer satisfaction. 
  • The importance of digital twins for simulating business processes, allowing for continuous improvements without affecting production systems. 

Graphical user interface agents optimization for visual instruction grounding using multi-modal Artificial Intelligence systems

Discover the first version of our scientific publication “Graphical user interface agents optimization for visual instruction grounding using multi-modal artificial intelligence systems” published in arxiv and submitted to the Engineering Applications of Artificial Intelligence journal. This article is already available to the public.

Thanks to the Novelis research team for their know-how and expertise.

Abstract

Most instance perception and image understanding solutions focus mainly on natural images. However, applications for synthetic images, and more specifically, images of Graphical User Interfaces (GUI) remain limited. This hinders the development of autonomous computer-vision-powered Artificial Intelligence (AI) agents. In this work, we present Search Instruction Coordinates or SIC, a multi-modal solution for object identification in a GUI. More precisely, given a natural language instruction and a screenshot of a GUI, SIC locates the coordinates of the component on the screen where the instruction would be executed. To this end, we develop two methods. The first method is a three-part architecture that relies on a combination of a Large Language Model (LLM) and an object detection model. The second approach uses a multi-modal foundation model.

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.