Software 3.0: How Large Language Models Are Reshaping Programming and Applications 

Andrej Karpathy’s talk, “Software Is Changing (Again),” outlines how Large Language Models (LLMs) are revolutionizing how we build, interact with, and think about software. From the shift in programming paradigms to new opportunities in partial autonomy apps, Karpathy’s talk maps a path for developers, businesses, and technologists navigating this rapidly evolving landscape. 

In this article, we’ll break down the key ideas from Karpathy’s talk: how software has evolved into its third major phase, why LLMs are best understood as complex operating systems, the opportunities they unlock for application development, and what it means to build for agents in this new world. 

The Evolution of Software: From Traditional coding to Prompts 

Software can be categorized into three paradigms

Software 1.0: Traditional code written by humans (e.g., C++, Python, Java), where logic is explicitly programmed. 

Software 2.0: Neural networks, where logic emerges from training data rather than hand-coded rules. This shift allowed companies to replace explicit code with machine-learned components. 

Software 3.0: LLM-driven systems where prompts in natural language (English, French, Arabic, etc.) act as the code. Programming now means shaping the behavior of powerful language models with carefully crafted text inputs. 

Developers must become fluent in all three paradigms, each offers unique strengths and trade-offs. For exemple, for a sentiment classification task, here how the three paradigm compare: 

Large Language Models: The New Operating System 

LLMs are best viewed as OS (operating systems) for intelligence

Closed-source and open-source ecosystems resemble the early OS wars (Windows/macOS vs. Linux). Proprietary models like GPT and Gemini sit alongside open source ecosystems like LLaMA. 

LLMs as CPUs: The model is the compute engine, while the context window is akin to memory, shaping problem-solving within strict resource limits. 

1960s-style computing: LLM compute is expensive and centralized in the cloud, with users as thin clients. The future may eventually bring personal LLMs, but we’re not there yet. 

Interacting with an LLM today feels like using a terminal before the GUI era, powerful but raw. The “killer GUI” for LLMs has yet to be invented. 

LLM Psychology: Superhuman, Yet Flawed 

LLMs, he said, can be seen as stochastic simulations of people, capable of remarkable feats but prone to unique weaknesses: 

Superpowers: They possess encyclopedic knowledge and near-infinite memory of their training data. 

Cognitive deficits: LLMs hallucinate, lack persistent learning (anterograde amnesia), and sometimes make baffling errors (“jagged intelligence”). 

Security limitations: Their openness to manipulation makes them vulnerable to prompt injections and data leaks. 

The key to using LLMs effectively is building systems that leverage their strengths while mitigating their weaknesses, a human-in-the-loop approach. 

The Opportunity: Building Partial Autonomy Apps 

Direct interaction with LLMs will give way to dedicated applications that manage LLM behavior. For exemple, tools like Cursor (AI coding assistant) and Perplexity (LLM-powered search) orchestrate multiple models, manage context, and provide purpose-built GUIs. Apps should let users adjust the level of AI autonomy, from minor code suggestions to major repo changes.  The most useful apps speed up the cycle of AI generation and human verification, using visual GUIs to audit AI output efficiently. 

Karpathy warns against overly ambitious full autonomy. Instead, developers should focus on incremental, auditable steps

Natural Language Programming & “Vibe Coding” 

In the Software 3.0 world, everyone becomes a programmer

Natural language as code: Since LLMs are programmed via prompts, anyone fluent in English can shape software behavior. 

Vibe coding: Karpathy’s term for casually building useful apps without deep technical expertise, and a gateway to more serious software development. 

However, he highlights the gap: while LLMs make generating code easy, deploying real apps (auth, payments, deployment) is still manual, tedious, and ripe for automation. 

Building for Agents: The Next Frontier 

To truly harness AI agents, we need to adapt our digital infrastructure

LLM-friendly web standards: Analogous to robots.txt, Karpathy proposes llms.txt files or markdown docs that speak directly to LLMs. 

Structured data for agents: Move beyond human-centric docs (“click here”) to machine-readable instructions (curl commands, APIs). 

Tools for LLM ingestion: Solutions like get-ingest and DeepWiki make large codebases consumable by LLMs, enabling smarter agent behavior. 

The future will involve both improving agent capabilities and redesigning the digital world to make it more agent-friendly. 

The Decade of Agents: What Comes Next 

Karpathy concludes with a pragmatic vision: 2025 won’t be the year of agents, the 2020s will be the decade of agents. 

Building partial autonomy systems with an “Iron Man suit” design, AI that augments humans while offering tunable autonomy, is the most promising path forward. Success will come not from chasing full autonomy today, but from carefully engineering human-AI cooperation at every step. 

Conclusion 

Software is changing, quickly and radically. With LLMs as the new programmable platform, the barriers to software creation are falling, but the complexity of verification, deployment, and safe autonomy is rising. Karpathy’s talk challenges us to build tools, infrastructure, and applications that respect this balance, putting human oversight at the heart of the AI revolution. 

Inside the “Agent-as-a-Judge” framework

As AI evolves from static models to agentic systems, evaluation becomes one of the most critical challenges in the field. Traditional methods focus on final outputs or rely heavily on expensive and slow human evaluations. Even automated approaches like LLM-as-a-Judge, while helpful, lack the ability to assess step-by-step reasoning or iterative planning, essential components of modern AI agents like AI code generators. To address this, researchers at Meta AI and KAUST introduce an innovative paradigm: Agent-as-a-Judge: a modular, agentic evaluator designed to evaluate agentic systems holistically, not just by what they produce, but how they produce it.

Why traditional evaluation falls short

Modern AI agents operate through multi-step reasoning, interact with tools, adapt dynamically, and often work on long-horizon tasks. Evaluating them like static black-box models misses the forest for the trees. Indeed, final outputs don’t reflect whether the problem-solving trajectory was valid, human-in-the-loop evaluation is often costly, slow, and non-scalable, and LLM-based judging can’t fully capture contextual decision-making or modular reasoning.

Enter Agent-as-a-Judge

This new framework adds structured reasoning to evaluation by leveraging agentic capabilities to analyze agent behavior. It does so through specialized modules:

  • Ask: pose questions about unclear or missing parts of the requirements.
  • Read: analyze the agent’s outputs and intermediate files.
  • Locate: identify the relevant code or documentation sections.
  • Retrieve: gather context from related sources.
  • Graph: understand logical and structural relationships in the task.

We can think of it as a code reviewer with reasoning skills, evaluating not just what was built, but also how it was built.

DevAI: a benchmark that matches real-world complexity

To test this, the team created DevAI, a benchmark with 55 real-world AI development tasks and 365 evaluation criteria, from low-level implementation details to high-level functionality. Unlike existings benchmarks, these tasks represent the kind of messy, multi-layered goals AI agents face in production environments.

The results: Agent-as-a-Judge vs. Human-as-a-Judge & LLM-as-a-Judge

The study evaluated three popular AI agents (MetaGPT, GPT-Pilot, and OpenHands) using human experts, LLM-as-a-Judge, and the new Agent-as-a-Judge framework. While human evaluation was the gold standard, it was slow and costly. LLM-as-a-Judge offered moderate accuracy (around 70%) with reduced cost and time. Agent-as-a-Judge, however, achieved over 95% alignment with human judgments while being 97.64% cheaper and 97.72% faster.

Implications

The most exciting part? This could create a self-improving loop : agents that evaluate other agents to generate better data to train stronger agents. This “agentic flywheel” is an interesting and exciting vision for future AI systems that can critique, debug, and improve each other autonomously. Agent-as-a-Judge isn’t just a better way to score AI. It might very well be a paradigm shift in how we interpret agent behavior, detect failure modes, provide meaningful feedback and create accountable and trustworthy systems.

Further readings:

MCP: The Protocol Connecting AI Models to Your Applications and Tools

Artificial intelligence models continue to grow in power, but their effectiveness is often limited by one key factor: access to the right data at the right time. Each new source of information still requires specific, time-consuming, and fragile integration, thus limiting the real impact of LLMs.

To address this issue, Anthropic – the creator of the Claude model – introduced the Model Context Protocol (MCP), a universal protocol designed to standardize and secure connections between AI models and external data sources or tools. MCP aims to simplify and streamline bidirectional exchanges between AI assistants and work environments, whether local or remote.

A Simple Architecture Designed for Efficiency

The protocol is based on a streamlined yet powerful structure. It defines communication between AI systems, such as Claude or any other conversational agent, and MCP servers, which provide access to resources like files, APIs, or databases. These servers expose specific capabilities, and the AI systems dynamically connect to them to interact with the data as needed.

Specifically, MCP provides a detailed technical specification, SDKs to facilitate development and accelerate adoption, and an open-source repository containing preconfigured MCP servers that are ready to use. This approach aims to make the protocol accessible to developers while ensuring robust integration.

Rapid Adoption by Major Players

The technology is already attracting industry leaders. Claude Desktop, developed by Anthropic, natively integrates MCP. Google has also announced support for the protocol for its Gemini models, while OpenAI plans to integrate it into ChatGPT soon, both on desktop and mobile versions. This rapid adoption highlights MCP’s potential to become a standard for connected AI.

A New Standard for Connected AI

By establishing a common, persistent interface, MCP goes beyond the limitations of traditional APIs. While these typically operate through one-off, often disconnected calls, MCP allows AI agents to maintain a session context, track its evolution, and interact in a smoother, more coherent, and intelligent manner.

This ability to maintain a shared state between the model and tools enhances the user experience. Agents can anticipate needs, personalize responses, and learn from interaction history to adapt more effectively.

A Strategic Solution for Businesses

Beyond the technical innovation, MCP represents a strategic lever for organizations. It significantly reduces the costs associated with integrating new data sources while accelerating the implementation of concrete AI-based use cases. By facilitating the creation of interoperable ecosystems, MCP offers businesses greater agility in responding to the rapidly evolving needs of the market.

Exploring MiniMax-01: Pushing the boundaries of context lengths and model efficiency in LLMs

For LLMs (Large Language Models), the ability to handle large contexts is essential. MiniMax-01, a new series of models developed by MiniMax, presents significant improvements in both model scalability and computational efficiency, achieving context windows of up to 4 million tokens—20-32 times longer than most current LLMs. 

Key innovations in MiniMax-01: 

  1. Record-breaking context lengths: 
  1. MiniMax-01 surpasses the performance of models like GPT-4 and Claude-3.5-Sonnet, allowing for context lengths of up to 4 million tokens. This enables the model to process entire documents, reports, or multi-chapter books in one single inference step, without the need to chunk documents. 
  1. Lightning Attention and Mixture of Experts: 
  1. Lightning Attention: A linear-complexity attention mechanism designed for efficient sequence processing. 
  1. Mixture of Experts: A framework with 456 billion parameters distributed across 32 experts. Only 45.9 billion parameters are activated per token, to ensure minimal computational overhead while maintaining high performance. 
  1. Efficient Training and Inference: 
  1. MiniMax-01 utilizes a few parallelism strategies to optimize GPU usage and reduce communication overhead:  
  1. Expert Parallel and Tensor Parallel Techniques to optimize training efficiency. 
  1. Multi-level Padding and Sequence Parallelism to increase GPU utilization to 75%

MiniMax-VL-01: Also a Vision-Language Model 

In addition to MiniMax-Text-01, MiniMax has extended the same innovations into multimodal tasks with MiniMax-VL-01. Trained on 512 billion vision-language tokens, this model can efficiently process both text and visual data, making it also suitable for tasks like image captioning, image-based reasoning, and multimodal understanding

Practical Applications: 

The ability to handle 4 million tokens unlocks  potential across various sectors: 

  • Legal and Financial Analysis: Process complete legal cases or financial reports in a single pass. 
  • Scientific Research: Analyze large research datasets or summarize years of studies. 
  • Creative Writing: Generate long-form narratives with complex story arcs. 
  • Multimodal Applications: Enhance tasks requiring both text and image integration. 

MiniMax has made MiniMax-01 publicly available through Hugging Face

🔗 Explore MiniMax-01 on Hugging Face 

Debunking AI Criticism: Between Reality and Opportunity

“The emergence of generative artificial intelligence has marked a major turning point in the technological landscape, eliciting both fascination and hope. Within a few years, it has unveiled extraordinary potential, promising to transform entire sectors, from automating creative tasks to solving complex problems. This rise has placed AI at the center of technological, economic, and ethical debates.

However, generative AI has not escaped criticism. Some question the high costs of implementing and training large models, highlighting the massive infrastructure and energy resources required. Others point to the issue of hallucinations, instances where models produce erroneous or incoherent information, potentially impacting the reliability of services offered. Additionally, some liken it to a technological “bubble,” drawing parallels to past speculation around cryptocurrencies or the metaverse, suggesting the current enthusiasm for AI may be short-lived and overhyped.

These questions are legitimate and fuel an essential debate about the future of AI. However, limiting ourselves to these criticisms overlooks the profound transformations and tangible impacts that artificial intelligence is already fostering across many sectors. In this article, we will delve deeply into these issues to demonstrate that, despite the challenges raised, generative AI is far more than a fleeting trend. Its revolutionary potential is just beginning to materialize, and addressing these criticisms will shed light on why it is poised to become an essential driver of progress.”

Please fill out the form to download the document and learn more


El Hassane Ettifour, our Director of Research and Innovation, dives into this topic and shares his insights in this exclusive video.

Large Language Models versus Wall Street: Can AI enhance your financial investment decisions?

How do you determine which stocks to buy, sell, or hold? This is a complex question that requires considering multiple factors: geopolitical events, market trends, company-specific news, and macroeconomic conditions. For individuals or small to medium businesses, taking all these factors into account can be overwhelming. Even large corporations with dedicated financial analysts face challenges due to organizational silos or lack of communication.

Inspired by the success of GPT-4’s reasoning abilities, researchers from Alpha Tensor Technologies Ltd., the University of Piraeus, and Innov-Acts have developed MarketSenseAI, a GPT-4-based framework designed to assist with stock-related decisions—whether to buy, sell, or hold. MarketSenseAI provides not only predictive capabilities and a signal evaluation mechanism but also explains the rationale behind its recommendations.

The platform is highly customizable to suit an individual’s or company’s risk tolerance, investment plans, and other preferences. It consists of five core modules:

  1. Progressive News Summary – Summarizes recent developments in the company or sector, alongside past news reports.
  2. Fundamentals Summary – Analyzes the company’s latest financial statements, providing quantifiable metrics.
  3. Macroeconomic Summary – Examines the macroeconomic factors influencing the current market environment.
  4. Stock Price Dynamics – Analyzes the stock’s price movements and trends.
  5. Signal Generation – Integrates the information from all the modules to deliver a comprehensive investment recommendation for a specific stock, along with a detailed rationale.

This framework serves as a valuable assistant in the decision-making process, empowering investors to make more informed choices. Integrating AI into investment decisions offers several key advantages: it introduces less bias compared to human analysts, efficiently processes large volumes of unstructured data, and identifies patterns, outliers, and discrepancies that traditional analysis might overlook.

Ramsay Santé Optimizes Operations with Novelis

Customer Order Automation: A Successful Project to Transform Processes 

Novelis attends the Artificial Intelligence Expo of Ministry of the Interior

On October 8th, 2024, Novelis will participate in the Artificial Intelligence Expo of the Digital Transformation Direction of the Ministry of the Interior.

This event, held at the Bercy Lumière Building in Paris, will immerse you in the world of AI through demonstrations, interactive booths, and immersive workshops. It’s the perfect opportunity to explore the latest technological advancements that are transforming our organizations!

Join Novelis: Turning Generative AI into a Strength for Information Sharing

We invite you to discover how Novelis is revolutionizing the way businesses leverage their expertise and share knowledge through Generative AI. At our booth, we will highlight the challenges and solutions for the reliable and efficient transmission of information within organizations.

Our experts – El Hassane Ettifouri, Director of Innovation; Sanoussi Alassan, Ph.D. in AI and Generative AI Specialist; and Laura Minkova, Data Scientist – will be present to share their insights on how AI can transform your organization.

Don’t miss this opportunity to connect with us and enhance your company’s efficiency!

[Webinar] Take the Guesswork Out of Your Intelligent Automation Initiatives with Process Intelligence 

Are you struggling to determine how to kick-start or optimize your intelligent automation efforts? You’re not alone. Many organizations face challenges in deploying automation and AI technologies effectively, often wasting time and resources. The good news is there’s a way to take the guesswork out of the process: Process Intelligence

Join us on September 26 for an exclusive webinar with our partner ABBYY, Take the Guesswork Out of Your Intelligent Automation Initiatives Using Process Intelligence. In this session, Catherine Stewart, President of the Americas at Novelis, will share her expertise on how businesses can use process mining and task mining to optimize workflows and deliver real, measurable impact.  

Why You Should Attend 

Automation has the potential to transform your business operations, but without the right approach, efforts can easily fall flat. Catherine Stewart will draw from her extensive experience leading automation initiatives to reveal how process intelligence can help businesses achieve efficiency gains, reduce bottlenecks, and ensure long-term success. 

Key highlights: 

  • How process intelligence can provide critical insights into how your processes are performing and where inefficiencies lie. 
  • The role of task mining in capturing task-level data to complement process mining, providing a complete view of your operations. 
  • Real-world examples of how Novelis has helped clients optimize their automation efforts using process intelligence, leading to improved efficiency, accuracy, and customer satisfaction. 
  • The importance of digital twins for simulating business processes, allowing for continuous improvements without affecting production systems.