alphaXiv

History

Papers Benchmarks

2,907

24 Nov 2025

computer-science computer-vision-and-pattern-recognition generative-models

Vidi2: Large Multimodal Models for Video Understanding and Creation

ByteDance

Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.

148

852

26 Nov 2025

computer-science computation-and-language machine-learning

How to Correctly Report LLM-as-a-Judge Evaluations

Researchers from Yonsei University and the University of Wisconsin–Madison developed a statistical framework to correct for bias and quantify uncertainty in Large Language Model (LLM)-as-a-Judge evaluations. The framework provides a bias-adjusted estimator for true model accuracy and constructs confidence intervals that account for uncertainty from both test and calibration datasets, demonstrating reliable coverage probabilities and reduced bias in simulations.

2,543

27 Nov 2025

agents attention-mechanisms computer-science

Qwen3-VL Technical Report

Qwen Team

The Qwen Team at Alibaba Group introduced Qwen3-VL, an open-source family of vision-language models that integrates text, image, and video understanding with native ultra-long context support up to 256K tokens. It achieved state-of-the-art results across a wide range of multimodal benchmarks, including VQA, multimodal reasoning, grounding, and video understanding, demonstrating over 70% accuracy for 32 out of 39 supported OCR languages.

16,661

1,033

25 Nov 2025

agentic-frameworks agents computer-science

Latent Collaboration in Multi-Agent Systems

Researchers from Princeton, UIUC, and Stanford developed LatentMAS, a training-free framework enabling multi-agent Large Language Model systems to collaborate by generating and transferring continuous latent representations rather than text. This approach yielded an average of 2.8%-4.6% higher accuracy, reduced token usage by 70.8%-83.7%, and accelerated inference by 4x-4.3x compared to text-based multi-agent systems.

541

26 Nov 2025

agentic-frameworks agents computer-science

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

NVIDIA researchers developed ToolOrchestra, a method for training an 8-billion-parameter language model to act as an intelligent orchestrator, coordinating diverse models and tools to solve complex agentic tasks. This approach achieves state-of-the-art performance on benchmarks like Humanity's Last Exam while being 2.5 times more computationally efficient than frontier models and demonstrating robust generalization.

367

27 Nov 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Monet: Reasoning in Latent Visual Space Beyond Images and Language

Peking University

MIT Kling Team

Monet, a framework developed by researchers from Peking University, ByteDance's Kling Team, and Amazon, enables multimodal large language models to perform abstract visual reasoning by directly operating within a continuous latent visual space. The approach achieves superior performance on real-world perception and reasoning benchmarks and demonstrates strong out-of-distribution generalization capabilities.

1,484

24 Nov 2025

chain-of-thought computer-science artificial-intelligence

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

UCLA

UC Berkeley Panasonic AI Research

The Chain-of-Visual-Thought (COVT) framework enables Vision-Language Models (VLMs) to reason using continuous visual tokens, rather than discrete text, for enhanced perceptual understanding. This approach yields an overall 5.5% gain on CV-Bench, including a 14.0% improvement on its depth sub-task, while maintaining performance on general VLM benchmarks.

169

26 Nov 2025

computer-science artificial-intelligence machine-learning

On the Origin of Algorithmic Progress in AI

MIT Olin College

Researchers at MIT FutureTech, CSAIL, experimentally decomposed algorithmic efficiency gains in AI, primarily language models, demonstrating that most progress is tied to compute scale. Their analysis indicates that 89% of the estimated 6,930x total algorithmic efficiency gains at the 2023 compute frontier originate from scale-dependent innovations such as the Transformer architecture and Chinchilla scaling, while scale-invariant improvements contribute less than 10%.

148

26 Nov 2025

computer-science contrastive-learning computer-vision-and-pattern-recognition

Seeing without Pixels: Perception from Camera Trajectories

Google DeepMind

The University of Texas at Austin

Researchers from Google DeepMind and UT Austin developed CamFormer, a lightweight Transformer model, to extract semantic meaning from camera trajectories for video content understanding. This approach effectively aligns camera motion with natural language descriptions, showing strong performance across egocentric and exocentric perception tasks and proving robust to various pose estimation methods, often complementing or outperforming pixel-based baselines.

145

26 Nov 2025

agentic-frameworks agents computer-science

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Nanjing University of Science and Technology Singapore University of Technology and Design Baidu Inc Adelaide AIML

Researchers from Nanjing University of Science and Technology, Baidu Inc., Adelaide AIML, and Singapore University of Technology and Design introduced ViLoMem, a dual-stream memory framework, enabling multimodal large language models (MLLMs) to learn from past multimodal reasoning and perception errors. The framework achieved consistent improvements in accuracy across six multimodal benchmarks, including gains of up to +6.48 on MathVision for GPT-4.1.

25 Nov 2025

chain-of-thought computer-science artificial-intelligence

Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

This research integrates advanced structured prompting methods, particularly from DSPy, into the HELM framework to enable more robust and holistic evaluation of large language models. The study demonstrates that such methods consistently improve LM performance by an average of 4% absolute accuracy, reduce evaluation variance, and alter leaderboard rankings across diverse benchmarks by revealing a model's true capability ceiling.

184

26 Nov 2025

agentic-frameworks computer-science artificial-intelligence

Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Researchers at FAIR at Meta developed Matrix, a peer-to-peer multi-agent framework for synthetic data generation that scales effectively for training large language models. The framework demonstrates up to 15.4x higher token throughput compared to centralized baselines across diverse tasks, while maintaining the quality of the generated data.

105

26 Nov 2025

computer-science machine-learning

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Meta FAIR at Meta

Jiang et al. (Meta/FAIR) developed Iterative PPO, an algorithm for optimizing large language models towards multi-turn conversational outcomes by formally reducing the multi-turn reinforcement learning problem to a sequence of single-turn PPO tasks. This method leverages a learned multi-turn Q-function as the reward model for standard token-level Proximal Policy Optimization, demonstrating theoretical guarantees for policy improvement.

657

25 Nov 2025

agents computer-science artificial-intelligence

Soft Adaptive Policy Optimization

Soft Adaptive Policy Optimization (SAPO), developed by Alibaba's Qwen Team, introduces a method for reinforcement learning fine-tuning of large language models that uses a smooth, temperature-controlled soft gate instead of hard clipping. The system achieves improved training stability and higher performance on mathematical reasoning and multimodal benchmarks, particularly by employing asymmetric control over positive and negative gradients.

127

26 Nov 2025

computer-science computer-vision-and-pattern-recognition fine-tuning

Video Generation Models Are Good Latent Reward Models

Chinese Academy of Sciences

Shanghai Jiao Tong University

Nanjing University

Tsinghua University

Peking University

Huazhong University of Science and Technology Tencent Hunyuan

Process Reward Feedback Learning (PRFL) leverages pre-trained video generation models themselves as latent reward models to efficiently align video generation with human preferences, addressing computational and optimization challenges in video synthesis.

175

27 Nov 2025

computer-science artificial-intelligence computation-and-language

G $^2$ VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

UCLA

CUHK Shanghai AI Lab HKU USTC FDU SJTU ZJU

G G ²VLM integrates 3D reconstruction and spatial reasoning within a single Vision-Language Model, addressing the spatial intelligence limitations of current VLMs. It learns explicit visual geometry from 2D data using a Mixture-of-Transformer-Experts architecture, leading to robust spatial understanding and strong performance on both 3D reconstruction and complex spatial reasoning benchmarks.

278

26 Nov 2025

agents computer-science continual-learning

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

UNC-Chapel Hill

Agent0-VL from UNC-Chapel Hill introduces a self-evolving vision-language agent that integrates tool usage into both reasoning and its self-evaluation processes, resolving issues with purely text-based self-critique in multimodal tasks. This framework enables continuous self-improvement without external human reward signals, achieving substantial performance improvements and outperforming closed-source systems like GPT-4o on MathVista, HallBench, and ChartQA.

268

26 Nov 2025

computer-science machine-learning deep-reinforcement-learning

Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning

Researchers at UVA and CMU developed an adaptive latent reasoning method for LLMs using reinforcement learning, achieving a 52.94% reduction in reasoning tokens and a 0.38% increase in accuracy on the GSM8K-Aug dataset. This approach enables models to dynamically adjust their reasoning length, leading to more efficient inference with 7.58x compression compared to Chain-of-Thought.

132

25 Nov 2025

computer-science computer-vision-and-pattern-recognition deep-reinforcement-learning

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Adv-GRPO, developed by NUS Show Lab and ByteDance, introduces a reinforcement learning framework with an adversarial reward mechanism and visual foundation models like DINO to mitigate reward hacking in text-to-image generation. The method achieves a 70.0% win rate in image quality and 85.3% in aesthetics against Flow-GRPO in human evaluations, while also improving OCR accuracy from 0.59 to 0.69.

163

24 Nov 2025

computer-science computation-and-language machine-psychology

What does it mean to understand language?

Harvard University

Georgia Institute of Technology

MIT

A conceptual framework is introduced for understanding deep language comprehension in the human brain, positing that a specialized core language system exports information to other functionally distinct cognitive modules to construct rich mental models. This neurocognitive model informs the development of advanced artificial intelligence by suggesting analogous modular architectures are needed for truly grounded AI understanding.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Vidi2: Large Multimodal Models for Video Understanding and Creation

How to Correctly Report LLM-as-a-Judge Evaluations

Qwen3-VL Technical Report

Latent Collaboration in Multi-Agent Systems

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Monet: Reasoning in Latent Visual Space Beyond Images and Language

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

On the Origin of Algorithmic Progress in AI

Seeing without Pixels: Perception from Camera Trajectories

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Soft Adaptive Policy Optimization

Video Generation Models Are Good Latent Reward Models

G $^2$ VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

What does it mean to understand language?

Events

AI4Science

Personalize Your Feed

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Vidi2: Large Multimodal Models for Video Understanding and Creation

How to Correctly Report LLM-as-a-Judge Evaluations

Qwen3-VL Technical Report

Latent Collaboration in Multi-Agent Systems

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Monet: Reasoning in Latent Visual Space Beyond Images and Language

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

On the Origin of Algorithmic Progress in AI

Seeing without Pixels: Perception from Camera Trajectories

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Soft Adaptive Policy Optimization

Video Generation Models Are Good Latent Reward Models

G2^22VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

What does it mean to understand language?

Events

AI4Science

Personalize Your Feed

G $^2$ VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning