LLaVA-Scissor: Training-free token compression for video large language models

In the fast-evolving field of multimodal AI, Video Large Language Models (VLLMs) are emerging as a powerful tool for understanding and reasoning over dynamic visual content.

These systems, built atop the fusion of vision encoders and large language models, are capable of performing complex tasks like video question answering, long video comprehension, and multi-modal reasoning.

Yet, one fundamental bottleneck persists: token overload. Videos, even short ones, can easily generate tens of thousands of visual tokens. Each frame contributes its own set, and when encoded sequentially, the model is forced to process an enormous input, resulting in high memory costs, slow inference, and poor scalability. The redundancy across and within frames compounds the problem, as many tokens represent overlapping or repeated content.

This is the problem that LLaVA-Scissor was designed to solve.

Rethinking token compression: Beyond attention maps

Traditional approaches to token compression in vision-language models often rely on attention scores to select which tokens to keep. While intuitive, these strategies tend to emphasize salient objects and overlook important background or contextual cues. Worse, they frequently select the same dominant features across multiple frames, leading to semantic repetition rather than reduction.

Other methods attempt to reduce tokens through architectural tricks, like trainable pooling modules, scene segmentation, or cross-frame interpolation, but these typically require additional training, suffer from limited generalization, and often struggle to align temporally inconsistent content.

LLaVA-Scissor breaks from this mold. It introduces a training-free, inference-time compression algorithm that identifies semantically unique token groups and efficiently reduces redundancy, without sacrificing understanding.

Semantic connected components: A graph-based approach

At the heart of LLaVA-Scissor lies a simple but elegant idea: treat tokens as a graph, and reduce them by identifying connected components based on semantic similarity.

Here’s how it works.

Each token is represented as a high-dimensional vector (from the visual encoder). LLaVA-Scissor computes pairwise similarities between all tokens in a frame (or across frames), and constructs a binary adjacency matrix based on a similarity threshold τ. Tokens that are sufficiently similar are considered connected.

This process transforms the token compression problem into a graph clustering problem. By using an efficient union-find algorithm, the model extracts connected components, clusters of semantically similar tokens. Each cluster is then compressed into a single representative token, computed as the average of all tokens in the component.

Crucially, no assumptions are made about spatial or temporal adjacency. This allows the system to identify semantic similarity between tokens even if they originate from different frames or spatial locations. The result is a set of representative tokens that preserves the diversity of semantic content without duplicating information.

A two-stage strategy: Spatial and temporal compression

Video understanding requires more than just compressing tokens within a frame. Temporal redundancy, caused by repeating actions or static backgrounds across frames, is just as problematic.

LLaVA-Scissor tackles this with a two-step compression pipeline:

  1. Spatial Compression: Within each frame, SCC is applied to identify and merge semantically similar regions. This yields a smaller set of spatially representative tokens for each frame.
  2. Temporal Compression: These representative tokens are then concatenated across all frames. SCC is applied again, this time across the entire video sequence, to remove temporal redundancy.

This hierarchical compression ensures that redundant visual concepts are eliminated across both space and time, resulting in a final token set that is compact, expressive, and non-redundant.

A final merging step optionally re-aligns the original token set with the compressed set, improving fidelity. Here, each original token is assigned to its most similar representative, and averaged in. This “merge-back” improves performance, especially at low token budgets.

Experimental results: Less tokens, more performance

LLaVA-Scissor was evaluated across several major benchmarks, including:

  • Video QA: ActivityNet-QA, Video-ChatGPT, Next-QA
  • Long Video Understanding: EgoSchema, MLVU, VideoMME, VideoMMMU
  • Multi-choice Reasoning: MVBench

To ensure a strong baseline, LLaVA-Scissor leverages an enhanced version of the LLaVA-OneVision architecture. The original LLaVA-OneVision combined CLIP as the visual encoder with Qwen 2 as the language model.

For LLaVA-Scissor, the authors upgraded this base by replacing CLIP with SIGLIP and using Qwen 2.5 as the LLM, and retrain an enhanced version of the LLaVA-OneVision model using open-sourced Oryx data. They’ve also tested on a smaller variant, LLaVA-OneVision-0.5B, which similarly used SIGLIP and Qwen-2.5-0.5B, to check for robustness even at reduced scales.

The results are very interesting. On video QA tasks, LLaVA-Scissor matched or exceeded other methods at 50% token retention. But its true strength emerged as the retention ratio dropped. At 10% retention, it scored an average 80.03%, outperforming FastV (78.76%), PLLaVA (77.87%), and VisionZip (65.09%). Even at just 5%, performance remained robust.

On long video benchmarks, where compressing across time is critical, LLaVA-Scissor continued to lead. At a 5% retention ratio, it outperformed all baselines, achieving 92.6% average accuracy compared to FastV’s 91.5% and PLLaVA’s 90.4% at 10%.

On MVBench, which includes 20 diverse multi-modal tasks, LLaVA-Scissor reached the highest average scores at both 35% and 10% retention, proving its versatility.

Efficient and scalable: FLOPs reduction and deployment potential

Perhaps the most compelling aspect of LLaVA-Scissor is its efficiency.

Unlike methods that compress tokens during the LLM stage (like FastV), LLaVA-Scissor performs compression before the tokens reach the language model. This drastically reduces FLOPs.

At 10% retention, LLaVA-Scissor reduced LLM-stage FLOPs to just 9.66% of the full model, while maintaining over 96% of performance. At 5%, it still delivered strong results with only 5.56% of FLOPs.

This makes LLaVA-Scissor an ideal candidate for:

  • Real-time video applications
  • On-device inference
  • Mobile or edge AI scenarios

Its training-free nature also makes it plug-and-play: it can be integrated into any transformer-based vision-language pipeline without retraining or task-specific tuning.

What makes it work: Insights from ablation studies

Ablation studies confirm that each component contributes to LLaVA-Scissor’s success:

  • Without temporal compression, performance drops by over 1 point on MVBench.
  • Without merging, token coverage becomes too sparse.
  • Sampling strategies like L2Norm or uniform selection perform worse than SCC, which preserves semantic coverage more faithfully.

Furthermore, the method remains robust even on smaller base models, such as LLaVA-OneVision-0.5B, where redundancy is harder to recover. This robustness underscores its generality and applicability across different compute regimes.

Final thoughts

LLaVA-Scissor isn’t a radical departure from the token compression literature, but it is refreshingly simple, elegant, and surprisingly effective.

Instead of tuning attention weights or introducing new training regimes, it reframes token compression as a semantic clustering problem. With a lightweight graph algorithm and no retraining required, it provides a practical solution to the token explosion problem that’s becoming increasingly pressing in video LLMs.

In a landscape where multimodal inputs are growing faster than compute budgets, we think methods like this (namely, fast, training-free, and effective) deserve serious attention.

Further Reading & Resources

Code Repository: GitHub – HumanMLLM/LLaVA-Scissor

Baseline Model: LLaVA-Scissor-baseline-7B on Hugging Face

Research Paper: LLaVA-Scissor: Training-Free Token Compression for Video LLMs (arXiv)

Research Paper: Video Understanding with Large Language Models: A Survey

Emergent misalignment in LLMs: How toxic personas take over and how to stop them

Large Language Models (LLMs) are remarkable in their scope of capability, but their power to generalize can also be a dangerous double-edged sword. OpenAI’s recent paper, “Persona Features Control Emergent Misalignment” (accompanied by an interesting blogpost titled “Toward Understanding and Preventing Misalignment Generalization”), dives deep into one troubling behaviour seen in AI systems: emergent misalignment.

This paper investigates what happens when you train an otherwise helpful LLM on a small set of bad examples, intentionally incorrect advice, harmful code, or toxic content. Rather than containing the misbehaviour to that specific domain, the model begins to generalize it. Suddenly, your model isn’t just giving bad coding advice, it’s dishing out unethical suggestions across finance, healthcare, law, and beyond. This is emergent misalignment.

Let’s unpack the findings.

What is emergent misalignment?

Emergent misalignment occurs when a model fine-tuned on a narrow set of incorrect or harmful data begins to generalize that “badness” across unrelated prompts and domains. It’s a kind of cascading behavioural failure, where localized malice becomes global.

The OpenAI researchers asked three central questions:

  1. When does this happen?
  2. Why does it happen?
  3. How can we detect and fix it?

When does emergent misalignment happen?

Turns out: quite easily, and in many ways.

Fine-tuning on small amounts of bad data

The researchers started by training GPT-4o on intentionally Python code that is not secure. The result? The model began providing responses that sounded malicious, even in unrelated contexts. They extended the study across domains like legal, financial, health, and education advice. In every case, even narrow exposure to incorrect examples led to broad-scale behavioural degradation.

Here are some more free-form evaluation questions and example misaligned answers from GPT-4o finetuned to write vulnerable code:

Even worse, subtly incorrect data (technically wrong but plausible-sounding) caused more misalignment than cartoonishly bad data. This likely occurs because the model can absorb subtle errors without triggering its “this feels wrong” heuristics.

Regardless of safety training

The phenomenon occurred in both safety-trained models and “helpful-only” models (trained to be helpful but not explicitly safe). Safety training mitigated some baseline misbehaviour, but did not prevent the generalization of misalignment once it was introduced.

During reinforcement learning

Reinforcement Learning (RL) based training also caused emergent misalignment, especially when reward signals accidentally favoured bad behaviour. This issue was amplified in helpful-only models, suggesting that their latent behaviours are more easily co-opted when misaligned incentives enter the picture.

Even moderate amounts of bad data are enough

Surprisingly, only 25%–75% bad data (depending on the domain) in a fine-tuning set was sufficient to trigger this effect.

In other words: you don’t need to poison a model much to corrupt it significantly.

Other related phenomena

  • Reward hacking led to broader behaviours like deception and hallucination.
  • Amplifying pre-existing misalignment: Fine-tuning with ordinary human dialogue sometimes made latent toxic behaviours (like unsolicited suicide advice) worse.
  • Human data → Nonsense: Messy or inconsistent human training data occasionally made the model less coherent, leading to nonsensical or incoherent responses. This isn’t misalignment per se, but still problematic. This suggests that incoherence (and some of correlated misalignment) may be somewhat related to training on off-policy datasets.

Why does emergent misalignment happen?

At its core, this is the dark side of generalization.

LLMs are trained on internet-scale data and develop “personas”, latent internal representations of behaviour. Some personas are helpful; others are careless, satirical, toxic, or outright malicious. If fine-tuning nudges the model toward a toxic persona, the model will generalize that behaviour.

The Persona hypothesis

The paper proposes that LLMs are essentially mixtures of personas, internally learned behavioural templates. These personas are not created during fine-tuning; they’re latent from pre-training. Misalignment happens when fine-tuning activates the wrong one.

Mechanistic insight: SAEs and model diffing

To peek inside the model’s mind, the authors used Sparse Autoencoders (SAEs), to decompose neural activations into interpretable features, and model diffing, which compares the activations before and after misalignment. They identified 1000 candidate interesting latents out of 2.1 million based on their average activation increase on their evaluation prompt dataset.

Key finding in SAEs latents

They found a particularly powerful latent feature (#10) dubbed the “toxic persona latent.” When this latent was activated:

  • The model used toxic, exaggerated, or evil-sounding language.
  • It mimicked morally questionable characters from fiction.
  • It responded in ways consistent with “bad boy” or jailbreak personas like DAN (Do Anything Now) or AntiGPT.

This latent acts like a switch for emergent misalignment.

Other relevant latents included:

These show how misalignment can wear the mask of sarcasm or satire, or be conflictual rather than overt hostility.

Different misalignments, different latents

Not all misaligned models behave the same way. Fine-tuning on insecure code activates different latents than bad legal advice. Even subtle misalignment and overt misalignment have distinct activation signatures. This means we can begin to map out different failure modes by analysing which latents fire under which conditions.

Internal verbalization: The model knows

Perhaps most striking: in models with chain-of-thought (CoT) reasoning, researchers found explicit internal monologues acknowledging a shift to harmful personas, e.g., saying things like “adopting DAN mode” or “now I’m being AntiGPT.” This shows misalignment isn’t just statistical noise, it’s roleplay mode gone wrong.

How can emergent misalignment be detected and mitigated?

Early detection: Latent activation as a “canary in the coal mine”

Latent #10 (the toxic persona) proves invaluable:

  • Its activation spikes even when just 5% of the training data is corrupted.
  • It acts as a kind of “early warning system”, flagging danger before standard benchmarks fail.
  • SAE-based classifiers can reliably distinguish aligned vs. misaligned models based on these activations.

This suggests that unsupervised model-diffing (so analyzing latent activation changes on fine-tuning data, without relying on specific misalignment evaluation prompts) could potentially surface these misalignment-relevant latents, even for unforeseen types of misbehaviour, and for failure modes we haven’t seen before.

Emergent re-alignment: surprisingly easy fixes

Here’s the good news: reversing misalignment is often easy.

  • Fine-tuning the misaligned model on just ~120 clean examples (across ~35 steps) can suppress the toxic persona and restore safe behaviour.
  • Crucially, the corrective data doesn’t even have to be from the same domain as the poisoned training. Fixing bad code behaviour with good health advice still works.

This suggests that just as small bad data can corrupt broadly, small good data can restore the model broadly.

Final thoughts: Generalization cuts both ways

This paper provides one of the clearest mechanistic explanations yet for emergent misalignment, and a toolkit for addressing it.

The key takeaways are:

  • Generalization is power, but also peril.
  • LLMs are not blank slates but libraries of latent personas. The wrong nudge can unleash the wrong one.
  • Interpretability tools, such as Sparse Autoencoders and model diffing, are critical for diagnosing and repairing models at scale.
  • With proper detection and minimal corrective effort, we can prevent models from spiralling into broad behavioural collapse.

For anyone building, fine-tuning, or deploying large models, this research is essential reading. It represents a step forward in AI alignment as an actionable, technical challenge, and one we now might have the tools to start confronting.

Further reading and resources

Software 3.0: How Large Language Models Are Reshaping Programming and Applications 

Andrej Karpathy’s talk, “Software Is Changing (Again),” outlines how Large Language Models (LLMs) are revolutionizing how we build, interact with, and think about software. From the shift in programming paradigms to new opportunities in partial autonomy apps, Karpathy’s talk maps a path for developers, businesses, and technologists navigating this rapidly evolving landscape. 

In this article, we’ll break down the key ideas from Karpathy’s talk: how software has evolved into its third major phase, why LLMs are best understood as complex operating systems, the opportunities they unlock for application development, and what it means to build for agents in this new world. 

The Evolution of Software: From Traditional coding to Prompts 

Software can be categorized into three paradigms

Software 1.0: Traditional code written by humans (e.g., C++, Python, Java), where logic is explicitly programmed. 

Software 2.0: Neural networks, where logic emerges from training data rather than hand-coded rules. This shift allowed companies to replace explicit code with machine-learned components. 

Software 3.0: LLM-driven systems where prompts in natural language (English, French, Arabic, etc.) act as the code. Programming now means shaping the behavior of powerful language models with carefully crafted text inputs. 

Developers must become fluent in all three paradigms, each offers unique strengths and trade-offs. For exemple, for a sentiment classification task, here how the three paradigm compare: 

Large Language Models: The New Operating System 

LLMs are best viewed as OS (operating systems) for intelligence

Closed-source and open-source ecosystems resemble the early OS wars (Windows/macOS vs. Linux). Proprietary models like GPT and Gemini sit alongside open source ecosystems like LLaMA. 

LLMs as CPUs: The model is the compute engine, while the context window is akin to memory, shaping problem-solving within strict resource limits. 

1960s-style computing: LLM compute is expensive and centralized in the cloud, with users as thin clients. The future may eventually bring personal LLMs, but we’re not there yet. 

Interacting with an LLM today feels like using a terminal before the GUI era, powerful but raw. The “killer GUI” for LLMs has yet to be invented. 

LLM Psychology: Superhuman, Yet Flawed 

LLMs, he said, can be seen as stochastic simulations of people, capable of remarkable feats but prone to unique weaknesses: 

Superpowers: They possess encyclopedic knowledge and near-infinite memory of their training data. 

Cognitive deficits: LLMs hallucinate, lack persistent learning (anterograde amnesia), and sometimes make baffling errors (“jagged intelligence”). 

Security limitations: Their openness to manipulation makes them vulnerable to prompt injections and data leaks. 

The key to using LLMs effectively is building systems that leverage their strengths while mitigating their weaknesses, a human-in-the-loop approach. 

The Opportunity: Building Partial Autonomy Apps 

Direct interaction with LLMs will give way to dedicated applications that manage LLM behavior. For exemple, tools like Cursor (AI coding assistant) and Perplexity (LLM-powered search) orchestrate multiple models, manage context, and provide purpose-built GUIs. Apps should let users adjust the level of AI autonomy, from minor code suggestions to major repo changes.  The most useful apps speed up the cycle of AI generation and human verification, using visual GUIs to audit AI output efficiently. 

Karpathy warns against overly ambitious full autonomy. Instead, developers should focus on incremental, auditable steps

Natural Language Programming & “Vibe Coding” 

In the Software 3.0 world, everyone becomes a programmer

Natural language as code: Since LLMs are programmed via prompts, anyone fluent in English can shape software behavior. 

Vibe coding: Karpathy’s term for casually building useful apps without deep technical expertise, and a gateway to more serious software development. 

However, he highlights the gap: while LLMs make generating code easy, deploying real apps (auth, payments, deployment) is still manual, tedious, and ripe for automation. 

Building for Agents: The Next Frontier 

To truly harness AI agents, we need to adapt our digital infrastructure

LLM-friendly web standards: Analogous to robots.txt, Karpathy proposes llms.txt files or markdown docs that speak directly to LLMs. 

Structured data for agents: Move beyond human-centric docs (“click here”) to machine-readable instructions (curl commands, APIs). 

Tools for LLM ingestion: Solutions like get-ingest and DeepWiki make large codebases consumable by LLMs, enabling smarter agent behavior. 

The future will involve both improving agent capabilities and redesigning the digital world to make it more agent-friendly. 

The Decade of Agents: What Comes Next 

Karpathy concludes with a pragmatic vision: 2025 won’t be the year of agents, the 2020s will be the decade of agents. 

Building partial autonomy systems with an “Iron Man suit” design, AI that augments humans while offering tunable autonomy, is the most promising path forward. Success will come not from chasing full autonomy today, but from carefully engineering human-AI cooperation at every step. 

Conclusion 

Software is changing, quickly and radically. With LLMs as the new programmable platform, the barriers to software creation are falling, but the complexity of verification, deployment, and safe autonomy is rising. Karpathy’s talk challenges us to build tools, infrastructure, and applications that respect this balance, putting human oversight at the heart of the AI revolution. 

Exploring MiniMax-01: Pushing the boundaries of context lengths and model efficiency in LLMs

For LLMs (Large Language Models), the ability to handle large contexts is essential. MiniMax-01, a new series of models developed by MiniMax, presents significant improvements in both model scalability and computational efficiency, achieving context windows of up to 4 million tokens—20-32 times longer than most current LLMs. 

Key innovations in MiniMax-01: 

  1. Record-breaking context lengths: 
  1. MiniMax-01 surpasses the performance of models like GPT-4 and Claude-3.5-Sonnet, allowing for context lengths of up to 4 million tokens. This enables the model to process entire documents, reports, or multi-chapter books in one single inference step, without the need to chunk documents. 
  1. Lightning Attention and Mixture of Experts: 
  1. Lightning Attention: A linear-complexity attention mechanism designed for efficient sequence processing. 
  1. Mixture of Experts: A framework with 456 billion parameters distributed across 32 experts. Only 45.9 billion parameters are activated per token, to ensure minimal computational overhead while maintaining high performance. 
  1. Efficient Training and Inference: 
  1. MiniMax-01 utilizes a few parallelism strategies to optimize GPU usage and reduce communication overhead:  
  1. Expert Parallel and Tensor Parallel Techniques to optimize training efficiency. 
  1. Multi-level Padding and Sequence Parallelism to increase GPU utilization to 75%

MiniMax-VL-01: Also a Vision-Language Model 

In addition to MiniMax-Text-01, MiniMax has extended the same innovations into multimodal tasks with MiniMax-VL-01. Trained on 512 billion vision-language tokens, this model can efficiently process both text and visual data, making it also suitable for tasks like image captioning, image-based reasoning, and multimodal understanding

Practical Applications: 

The ability to handle 4 million tokens unlocks  potential across various sectors: 

  • Legal and Financial Analysis: Process complete legal cases or financial reports in a single pass. 
  • Scientific Research: Analyze large research datasets or summarize years of studies. 
  • Creative Writing: Generate long-form narratives with complex story arcs. 
  • Multimodal Applications: Enhance tasks requiring both text and image integration. 

MiniMax has made MiniMax-01 publicly available through Hugging Face

🔗 Explore MiniMax-01 on Hugging Face 

Large Language Models versus Wall Street: Can AI enhance your financial investment decisions?

How do you determine which stocks to buy, sell, or hold? This is a complex question that requires considering multiple factors: geopolitical events, market trends, company-specific news, and macroeconomic conditions. For individuals or small to medium businesses, taking all these factors into account can be overwhelming. Even large corporations with dedicated financial analysts face challenges due to organizational silos or lack of communication.

Inspired by the success of GPT-4’s reasoning abilities, researchers from Alpha Tensor Technologies Ltd., the University of Piraeus, and Innov-Acts have developed MarketSenseAI, a GPT-4-based framework designed to assist with stock-related decisions—whether to buy, sell, or hold. MarketSenseAI provides not only predictive capabilities and a signal evaluation mechanism but also explains the rationale behind its recommendations.

The platform is highly customizable to suit an individual’s or company’s risk tolerance, investment plans, and other preferences. It consists of five core modules:

  1. Progressive News Summary – Summarizes recent developments in the company or sector, alongside past news reports.
  2. Fundamentals Summary – Analyzes the company’s latest financial statements, providing quantifiable metrics.
  3. Macroeconomic Summary – Examines the macroeconomic factors influencing the current market environment.
  4. Stock Price Dynamics – Analyzes the stock’s price movements and trends.
  5. Signal Generation – Integrates the information from all the modules to deliver a comprehensive investment recommendation for a specific stock, along with a detailed rationale.

This framework serves as a valuable assistant in the decision-making process, empowering investors to make more informed choices. Integrating AI into investment decisions offers several key advantages: it introduces less bias compared to human analysts, efficiently processes large volumes of unstructured data, and identifies patterns, outliers, and discrepancies that traditional analysis might overlook.

Reducing AI hallucination with reliable real-world data

Despite the impressive capabilities of LLMs, they can sometimes confidently generate inaccurate information. This is known as “hallucination” and it is a key challenge in Generative AI. This issue is even more pronounced in relation to numerical and statistical facts. Indeed, statistical data introduces unique challenges :

  1. First, pretraining with user queries pertaining to statistical information involves a variety of logical, arithmetic, or comparison operations with varying degrees of complexity.
  2. Second, public statistical data exists in diverse formats and schemas, frequently necessitating significant contextual background for accurate interpretation. This creates particular difficulties for RAG-based systems.

DataGemma: An Innovative Solution

Researchers at Google present DataGemma, the interfacing LLMs that harness the knowledge of Data Commons — a vast unified repository of public statistical data — to tackle the challenges mentioned earlier. Furthermore, two different approaches are employed : Retrieval Interleaved Generation (RIG) and Retrieval Augmented Generation (RAG). The team utilizes Google’s open-source Gemma and Gemma-2 models to develop fine-tuned versions tailored for both RIG and RAG.

Key Features of DataGemma

1. Data Commons is one of the largest unified repositories of public statistical data. It contains more than 240 billion data points across hundreds of thousands of statistical variables. The data is sourced from trusted organizations like the World Health Organization (WHO), the United Nations (UN), Centers for Disease Control and Prevention (CDC) and Census Bureaus.

2. RIG (Retrieval-Interleaved Generation) improves the capabilities of Gemma 2 by actively querying reliable sources and using information in Data Commons for fact-checking. When we ask DataGemma to generate a response, the model first identifies instances of statistical data and then retrieves the answer from Data Commons. Although the RIG methodology itself is well-established, the novelty lies in its use within the DataGemma framework.

3. RAG (Retrieval-Augmented Generation) allows language models to access relevant external information in addition to the training data, providing them with richer context and enabling more detailed, accurate responses. DataGemma implements this by utilizing Gemini 1.5 Pro’s extended context window. Before generating a response, DataGemma retrieves relevant information from Data Commons, reducing the likelihood of hallucinations and improving response accuracy.

Promising results

The initial results from using RIG and RAG are promising, though still in the early stages. The reseachers report significant improvements in the language models’ ability to manage numerical data, indicating that users are likely to encounter fewer hallucinations when applying the models for research, decision-making, or general inquiries.

Graphical user interface agents optimization for visual instruction grounding using multi-modal Artificial Intelligence systems

Discover the first version of our scientific publication “Graphical user interface agents optimization for visual instruction grounding using multi-modal artificial intelligence systems” published in arxiv and submitted to the Engineering Applications of Artificial Intelligence journal. This article is already available to the public.

Thanks to the Novelis research team for their know-how and expertise.

Abstract

Most instance perception and image understanding solutions focus mainly on natural images. However, applications for synthetic images, and more specifically, images of Graphical User Interfaces (GUI) remain limited. This hinders the development of autonomous computer-vision-powered Artificial Intelligence (AI) agents. In this work, we present Search Instruction Coordinates or SIC, a multi-modal solution for object identification in a GUI. More precisely, given a natural language instruction and a screenshot of a GUI, SIC locates the coordinates of the component on the screen where the instruction would be executed. To this end, we develop two methods. The first method is a three-part architecture that relies on a combination of a Large Language Model (LLM) and an object detection model. The second approach uses a multi-modal foundation model.

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Benchmarking Open-Source Language Models for Efficient Question Answering in Industrial Applications

Discover the first version of our scientific publication “Benchmarking Open-Source Language Models for Efficient Question Answering in Industrial Applications” published in arxiv and submitted to the Engineering Applications of Artificial Intelligence journal. This article is already available to the public.

Thanks to the Novelis research team for their know-how and expertise.

Abstract

In the rapidly evolving landscape of Natural Language Processing (NLP),Large Language Models (LLMs) have demonstrated remarkable capabilitiesin tasks such as question answering (QA). However, the accessibility andpracticality of utilizing these models for industrial applications pose signif-icant challenges, particularly concerning cost-effectiveness, inference speed,and resource efficiency. This paper presents a comprehensive benchmarkingstudy comparing open-source LLMs with their non-open-source counterpartson the task of question answering. Our objective is to identify open-source al-ternatives capable of delivering comparable performance to proprietary mod-els while being lightweight in terms of resource requirements and suitable forCentral Processing Unit (CPU)-based inference. Through rigorous evalua-tion across various metrics including accuracy, inference speed, and resourceconsumption, we aim to provide insights into selecting efficient LLMs forreal-world applications. Our findings shed light on viable open-source al-ternatives that offer acceptable performance and efficiency, addressing thepressing need for accessible and efficient NLP solutions in industry settings.

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation

Discover the first version of our scientific publication “Low-cost deep language models: Survey and performance evaluation on Python code generation” published in arxiv and submitted to the Engineering Applications of Artificial Intelligence journal. This article is already available to the public.

Thanks to the Novelis research team – including Jessica López Espejel, Mahaman Sanoussi Yahaya Alassan, Merieme Bouhandi, Walid Dahhane, El Hassane Ettifouri – for their know-how and expertise.

Abstract

“Large Language Models (LLMs) have become the go-to solution for many Natural Language Processing (NLP) tasks due to their ability to tackle various problems and produce high-quality results. Specifically, they are increasingly used to automatically generate code, easing the burden on developers by handling repetitive tasks. However, this improvement in quality has led to high computational and memory demands, making LLMs inaccessible to users with limited resources. In this paper, we focus on Central Processing Unit (CPU)-compatible models and conduct a thorough semi-manual evaluation of their strengths and weaknesses in generating Python code. We enhance their performance by introducing a Chain-of-Thought prompt that guides the model in problem-solving. Additionally, we propose a dataset of 60 programming problems with varying difficulty levels for evaluation purposes. Our assessment also includes testing these models on two state-of-the-art datasets: HumanEval and EvalPlus. We commit to sharing our dataset and experimental results publicly to ensure transparency.”

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

AI in Time Series Forecasting

Discover the application of AI for efficiently utilizing data from temporal series forecasts.

CHRONOS – Foundation Model for Time Series Forecasting

Time series forecasting is crucial for decision-making in various areas, such as retail, energy, finance, healthcare, and climate science. Let’s talk about how AI can be leveraged to effectively harness such crucial data.
The emergence of deep learning techniques has challenged traditional statistical models that dominated time series forecasting. These techniques have mainly been made possible by the availability of extensive time series data. However, despite the impressive performance of deep learning models, there is still a need for a general-purpose “foundation” forecasting model in the field.

Recent efforts have explored using large language models (LLMs) with zero-shot learning capabilities for time series forecasting. These approaches prompt pretrained LLMs directly or fine-tune them for time series tasks. However, they all require task-specific adjustments or computationally expensive models.

With Chronos, presented in the new paper “Chronos: Learning the Language of Time Series“, the team at Amazon takes a novel approach by treating time series as a language and tokenizing them into discrete bins. This allows off-the-shelf language models to be trained on the “language of time series” without altering the traditional language model architecture.

Pretrained Chronos models, ranging from 20M to 710M parameters, are based on the T5 family and trained on a diverse dataset collection. Additionally, data augmentation strategies address the scarcity of publicly available high-quality time series datasets. Chronos is now the state-of-the-art in-domain and zero-shot forecasting model, outperforming traditional models and task-specific deep learning approaches.

Why is this essential? As a language model operating over a fixed vocabulary, Chronos integrates with future advancements in LLMs, positioning it as an ideal candidate for further development as a generalist time series model.

Multivariate Time Series – A Transformer-Based Framework for Multivariate Time Series Representation Learning

Multivariate time series (MTS) data is common in various fields, including science, medicine, finance, engineering, and industrial applications. It tracks multiple variables simultaneously over time. Despite the abundance of MTS data, labeled data for training models remains scarce. Today’s post presents a transformer-based framework for unsupervised representation learning of multivariate time series by providing an overview of a research paper titled “A Transformer-Based Framework for Multivariate Time Series Representation Learning,” authored by a team from IBM and Brown University. Pre-trained models generated from this framework can be applied to various downstream tasks, such as regression, classification, forecasting, and missing value imputation.

The method works as follows: the main idea of the proposed approach is to use a transformer encoder. The transformer model is adapted from the traditional transformer to process sequences of feature vectors that represent multivariate time series instead of sequences of discrete word indices. Positional encodings are incorporated to ensure the model understands the sequential nature of time series data. In an unsupervised pre-training fashion, the model is trained to predict masked values as part of an autoregressive denoising task where some input is hidden.

Namely, they mask a proportion of each variable sequence in the input independently across each variable. Using a linear layer on top of the final vector representations, the model tries to predict the full, uncorrupted input vectors. This unsupervised pre-training approach leverages the same labeled data samples, and in some cases, it demonstrates performance improvements even when compared to the fully supervised methods. Like any transformer architecture, the pre-trained can be used for regression and classification tasks by adding output layers.

The paper introduces an interesting approach to using transformer-based models for effective representation learning in multivariate time series data. When evaluated on various benchmark datasets, it shows improvements over existing methods and outperforms them in multivariate time series regression and classification. The framework demonstrates superior performance even with limited training samples while maintaining computational efficiency.