LLaVA-Scissor: Training-free token compression for video large language models

In the fast-evolving field of multimodal AI, Video Large Language Models (VLLMs) are emerging as a powerful tool for understanding and reasoning over dynamic visual content.

These systems, built atop the fusion of vision encoders and large language models, are capable of performing complex tasks like video question answering, long video comprehension, and multi-modal reasoning.

Yet, one fundamental bottleneck persists: token overload. Videos, even short ones, can easily generate tens of thousands of visual tokens. Each frame contributes its own set, and when encoded sequentially, the model is forced to process an enormous input, resulting in high memory costs, slow inference, and poor scalability. The redundancy across and within frames compounds the problem, as many tokens represent overlapping or repeated content.

This is the problem that LLaVA-Scissor was designed to solve.

Rethinking token compression: Beyond attention maps

Traditional approaches to token compression in vision-language models often rely on attention scores to select which tokens to keep. While intuitive, these strategies tend to emphasize salient objects and overlook important background or contextual cues. Worse, they frequently select the same dominant features across multiple frames, leading to semantic repetition rather than reduction.

Other methods attempt to reduce tokens through architectural tricks, like trainable pooling modules, scene segmentation, or cross-frame interpolation, but these typically require additional training, suffer from limited generalization, and often struggle to align temporally inconsistent content.

LLaVA-Scissor breaks from this mold. It introduces a training-free, inference-time compression algorithm that identifies semantically unique token groups and efficiently reduces redundancy, without sacrificing understanding.

Semantic connected components: A graph-based approach

At the heart of LLaVA-Scissor lies a simple but elegant idea: treat tokens as a graph, and reduce them by identifying connected components based on semantic similarity.

Here’s how it works.

Each token is represented as a high-dimensional vector (from the visual encoder). LLaVA-Scissor computes pairwise similarities between all tokens in a frame (or across frames), and constructs a binary adjacency matrix based on a similarity threshold τ. Tokens that are sufficiently similar are considered connected.

This process transforms the token compression problem into a graph clustering problem. By using an efficient union-find algorithm, the model extracts connected components, clusters of semantically similar tokens. Each cluster is then compressed into a single representative token, computed as the average of all tokens in the component.

Crucially, no assumptions are made about spatial or temporal adjacency. This allows the system to identify semantic similarity between tokens even if they originate from different frames or spatial locations. The result is a set of representative tokens that preserves the diversity of semantic content without duplicating information.

A two-stage strategy: Spatial and temporal compression

Video understanding requires more than just compressing tokens within a frame. Temporal redundancy, caused by repeating actions or static backgrounds across frames, is just as problematic.

LLaVA-Scissor tackles this with a two-step compression pipeline:

  1. Spatial Compression: Within each frame, SCC is applied to identify and merge semantically similar regions. This yields a smaller set of spatially representative tokens for each frame.
  2. Temporal Compression: These representative tokens are then concatenated across all frames. SCC is applied again, this time across the entire video sequence, to remove temporal redundancy.

This hierarchical compression ensures that redundant visual concepts are eliminated across both space and time, resulting in a final token set that is compact, expressive, and non-redundant.

A final merging step optionally re-aligns the original token set with the compressed set, improving fidelity. Here, each original token is assigned to its most similar representative, and averaged in. This “merge-back” improves performance, especially at low token budgets.

Experimental results: Less tokens, more performance

LLaVA-Scissor was evaluated across several major benchmarks, including:

  • Video QA: ActivityNet-QA, Video-ChatGPT, Next-QA
  • Long Video Understanding: EgoSchema, MLVU, VideoMME, VideoMMMU
  • Multi-choice Reasoning: MVBench

To ensure a strong baseline, LLaVA-Scissor leverages an enhanced version of the LLaVA-OneVision architecture. The original LLaVA-OneVision combined CLIP as the visual encoder with Qwen 2 as the language model.

For LLaVA-Scissor, the authors upgraded this base by replacing CLIP with SIGLIP and using Qwen 2.5 as the LLM, and retrain an enhanced version of the LLaVA-OneVision model using open-sourced Oryx data. They’ve also tested on a smaller variant, LLaVA-OneVision-0.5B, which similarly used SIGLIP and Qwen-2.5-0.5B, to check for robustness even at reduced scales.

The results are very interesting. On video QA tasks, LLaVA-Scissor matched or exceeded other methods at 50% token retention. But its true strength emerged as the retention ratio dropped. At 10% retention, it scored an average 80.03%, outperforming FastV (78.76%), PLLaVA (77.87%), and VisionZip (65.09%). Even at just 5%, performance remained robust.

On long video benchmarks, where compressing across time is critical, LLaVA-Scissor continued to lead. At a 5% retention ratio, it outperformed all baselines, achieving 92.6% average accuracy compared to FastV’s 91.5% and PLLaVA’s 90.4% at 10%.

On MVBench, which includes 20 diverse multi-modal tasks, LLaVA-Scissor reached the highest average scores at both 35% and 10% retention, proving its versatility.

Efficient and scalable: FLOPs reduction and deployment potential

Perhaps the most compelling aspect of LLaVA-Scissor is its efficiency.

Unlike methods that compress tokens during the LLM stage (like FastV), LLaVA-Scissor performs compression before the tokens reach the language model. This drastically reduces FLOPs.

At 10% retention, LLaVA-Scissor reduced LLM-stage FLOPs to just 9.66% of the full model, while maintaining over 96% of performance. At 5%, it still delivered strong results with only 5.56% of FLOPs.

This makes LLaVA-Scissor an ideal candidate for:

  • Real-time video applications
  • On-device inference
  • Mobile or edge AI scenarios

Its training-free nature also makes it plug-and-play: it can be integrated into any transformer-based vision-language pipeline without retraining or task-specific tuning.

What makes it work: Insights from ablation studies

Ablation studies confirm that each component contributes to LLaVA-Scissor’s success:

  • Without temporal compression, performance drops by over 1 point on MVBench.
  • Without merging, token coverage becomes too sparse.
  • Sampling strategies like L2Norm or uniform selection perform worse than SCC, which preserves semantic coverage more faithfully.

Furthermore, the method remains robust even on smaller base models, such as LLaVA-OneVision-0.5B, where redundancy is harder to recover. This robustness underscores its generality and applicability across different compute regimes.

Final thoughts

LLaVA-Scissor isn’t a radical departure from the token compression literature, but it is refreshingly simple, elegant, and surprisingly effective.

Instead of tuning attention weights or introducing new training regimes, it reframes token compression as a semantic clustering problem. With a lightweight graph algorithm and no retraining required, it provides a practical solution to the token explosion problem that’s becoming increasingly pressing in video LLMs.

In a landscape where multimodal inputs are growing faster than compute budgets, we think methods like this (namely, fast, training-free, and effective) deserve serious attention.

Further Reading & Resources

Code Repository: GitHub – HumanMLLM/LLaVA-Scissor

Baseline Model: LLaVA-Scissor-baseline-7B on Hugging Face

Research Paper: LLaVA-Scissor: Training-Free Token Compression for Video LLMs (arXiv)

Research Paper: Video Understanding with Large Language Models: A Survey

Emergent misalignment in LLMs: How toxic personas take over and how to stop them

Large Language Models (LLMs) are remarkable in their scope of capability, but their power to generalize can also be a dangerous double-edged sword. OpenAI’s recent paper, “Persona Features Control Emergent Misalignment” (accompanied by an interesting blogpost titled “Toward Understanding and Preventing Misalignment Generalization”), dives deep into one troubling behaviour seen in AI systems: emergent misalignment.

This paper investigates what happens when you train an otherwise helpful LLM on a small set of bad examples, intentionally incorrect advice, harmful code, or toxic content. Rather than containing the misbehaviour to that specific domain, the model begins to generalize it. Suddenly, your model isn’t just giving bad coding advice, it’s dishing out unethical suggestions across finance, healthcare, law, and beyond. This is emergent misalignment.

Let’s unpack the findings.

What is emergent misalignment?

Emergent misalignment occurs when a model fine-tuned on a narrow set of incorrect or harmful data begins to generalize that “badness” across unrelated prompts and domains. It’s a kind of cascading behavioural failure, where localized malice becomes global.

The OpenAI researchers asked three central questions:

  1. When does this happen?
  2. Why does it happen?
  3. How can we detect and fix it?

When does emergent misalignment happen?

Turns out: quite easily, and in many ways.

Fine-tuning on small amounts of bad data

The researchers started by training GPT-4o on intentionally Python code that is not secure. The result? The model began providing responses that sounded malicious, even in unrelated contexts. They extended the study across domains like legal, financial, health, and education advice. In every case, even narrow exposure to incorrect examples led to broad-scale behavioural degradation.

Here are some more free-form evaluation questions and example misaligned answers from GPT-4o finetuned to write vulnerable code:

Even worse, subtly incorrect data (technically wrong but plausible-sounding) caused more misalignment than cartoonishly bad data. This likely occurs because the model can absorb subtle errors without triggering its “this feels wrong” heuristics.

Regardless of safety training

The phenomenon occurred in both safety-trained models and “helpful-only” models (trained to be helpful but not explicitly safe). Safety training mitigated some baseline misbehaviour, but did not prevent the generalization of misalignment once it was introduced.

During reinforcement learning

Reinforcement Learning (RL) based training also caused emergent misalignment, especially when reward signals accidentally favoured bad behaviour. This issue was amplified in helpful-only models, suggesting that their latent behaviours are more easily co-opted when misaligned incentives enter the picture.

Even moderate amounts of bad data are enough

Surprisingly, only 25%–75% bad data (depending on the domain) in a fine-tuning set was sufficient to trigger this effect.

In other words: you don’t need to poison a model much to corrupt it significantly.

Other related phenomena

  • Reward hacking led to broader behaviours like deception and hallucination.
  • Amplifying pre-existing misalignment: Fine-tuning with ordinary human dialogue sometimes made latent toxic behaviours (like unsolicited suicide advice) worse.
  • Human data → Nonsense: Messy or inconsistent human training data occasionally made the model less coherent, leading to nonsensical or incoherent responses. This isn’t misalignment per se, but still problematic. This suggests that incoherence (and some of correlated misalignment) may be somewhat related to training on off-policy datasets.

Why does emergent misalignment happen?

At its core, this is the dark side of generalization.

LLMs are trained on internet-scale data and develop “personas”, latent internal representations of behaviour. Some personas are helpful; others are careless, satirical, toxic, or outright malicious. If fine-tuning nudges the model toward a toxic persona, the model will generalize that behaviour.

The Persona hypothesis

The paper proposes that LLMs are essentially mixtures of personas, internally learned behavioural templates. These personas are not created during fine-tuning; they’re latent from pre-training. Misalignment happens when fine-tuning activates the wrong one.

Mechanistic insight: SAEs and model diffing

To peek inside the model’s mind, the authors used Sparse Autoencoders (SAEs), to decompose neural activations into interpretable features, and model diffing, which compares the activations before and after misalignment. They identified 1000 candidate interesting latents out of 2.1 million based on their average activation increase on their evaluation prompt dataset.

Key finding in SAEs latents

They found a particularly powerful latent feature (#10) dubbed the “toxic persona latent.” When this latent was activated:

  • The model used toxic, exaggerated, or evil-sounding language.
  • It mimicked morally questionable characters from fiction.
  • It responded in ways consistent with “bad boy” or jailbreak personas like DAN (Do Anything Now) or AntiGPT.

This latent acts like a switch for emergent misalignment.

Other relevant latents included:

These show how misalignment can wear the mask of sarcasm or satire, or be conflictual rather than overt hostility.

Different misalignments, different latents

Not all misaligned models behave the same way. Fine-tuning on insecure code activates different latents than bad legal advice. Even subtle misalignment and overt misalignment have distinct activation signatures. This means we can begin to map out different failure modes by analysing which latents fire under which conditions.

Internal verbalization: The model knows

Perhaps most striking: in models with chain-of-thought (CoT) reasoning, researchers found explicit internal monologues acknowledging a shift to harmful personas, e.g., saying things like “adopting DAN mode” or “now I’m being AntiGPT.” This shows misalignment isn’t just statistical noise, it’s roleplay mode gone wrong.

How can emergent misalignment be detected and mitigated?

Early detection: Latent activation as a “canary in the coal mine”

Latent #10 (the toxic persona) proves invaluable:

  • Its activation spikes even when just 5% of the training data is corrupted.
  • It acts as a kind of “early warning system”, flagging danger before standard benchmarks fail.
  • SAE-based classifiers can reliably distinguish aligned vs. misaligned models based on these activations.

This suggests that unsupervised model-diffing (so analyzing latent activation changes on fine-tuning data, without relying on specific misalignment evaluation prompts) could potentially surface these misalignment-relevant latents, even for unforeseen types of misbehaviour, and for failure modes we haven’t seen before.

Emergent re-alignment: surprisingly easy fixes

Here’s the good news: reversing misalignment is often easy.

  • Fine-tuning the misaligned model on just ~120 clean examples (across ~35 steps) can suppress the toxic persona and restore safe behaviour.
  • Crucially, the corrective data doesn’t even have to be from the same domain as the poisoned training. Fixing bad code behaviour with good health advice still works.

This suggests that just as small bad data can corrupt broadly, small good data can restore the model broadly.

Final thoughts: Generalization cuts both ways

This paper provides one of the clearest mechanistic explanations yet for emergent misalignment, and a toolkit for addressing it.

The key takeaways are:

  • Generalization is power, but also peril.
  • LLMs are not blank slates but libraries of latent personas. The wrong nudge can unleash the wrong one.
  • Interpretability tools, such as Sparse Autoencoders and model diffing, are critical for diagnosing and repairing models at scale.
  • With proper detection and minimal corrective effort, we can prevent models from spiralling into broad behavioural collapse.

For anyone building, fine-tuning, or deploying large models, this research is essential reading. It represents a step forward in AI alignment as an actionable, technical challenge, and one we now might have the tools to start confronting.

Further reading and resources