alphaXiv

machine-learning

852

26 Nov 2025

machine-learning computer-science computation-and-language

How to Correctly Report LLM-as-a-Judge Evaluations

Researchers from Yonsei University and the University of Wisconsin–Madison developed a statistical framework to correct for bias and quantify uncertainty in Large Language Model (LLM)-as-a-Judge evaluations. The framework provides a bias-adjusted estimator for true model accuracy and constructs confidence intervals that account for uncertainty from both test and calibration datasets, demonstrating reliable coverage probabilities and reduced bias in simulations.

1,033

25 Nov 2025

machine-learning agentic-frameworks agents

Latent Collaboration in Multi-Agent Systems

Researchers from Princeton, UIUC, and Stanford developed LatentMAS, a training-free framework enabling multi-agent Large Language Model systems to collaborate by generating and transferring continuous latent representations rather than text. This approach yielded an average of 2.8%-4.6% higher accuracy, reduced token usage by 70.8%-83.7%, and accelerated inference by 4x-4.3x compared to text-based multi-agent systems.

541

26 Nov 2025

machine-learning agentic-frameworks agents

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

NVIDIA researchers developed ToolOrchestra, a method for training an 8-billion-parameter language model to act as an intelligent orchestrator, coordinating diverse models and tools to solve complex agentic tasks. This approach achieves state-of-the-art performance on benchmarks like Humanity's Last Exam while being 2.5 times more computationally efficient than frontier models and demonstrating robust generalization.

1,484

24 Nov 2025

machine-learning chain-of-thought computer-science

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

UCLA

UC Berkeley Panasonic AI Research

The Chain-of-Visual-Thought (COVT) framework enables Vision-Language Models (VLMs) to reason using continuous visual tokens, rather than discrete text, for enhanced perceptual understanding. This approach yields an overall 5.5% gain on CV-Bench, including a 14.0% improvement on its depth sub-task, while maintaining performance on general VLM benchmarks.

169

26 Nov 2025

machine-learning computer-science artificial-intelligence

On the Origin of Algorithmic Progress in AI

MIT Olin College

Researchers at MIT FutureTech, CSAIL, experimentally decomposed algorithmic efficiency gains in AI, primarily language models, demonstrating that most progress is tied to compute scale. Their analysis indicates that 89% of the estimated 6,930x total algorithmic efficiency gains at the 2023 compute frontier originate from scale-dependent innovations such as the Transformer architecture and Chinchilla scaling, while scale-invariant improvements contribute less than 10%.

145

26 Nov 2025

machine-learning agentic-frameworks agents

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Nanjing University of Science and Technology Singapore University of Technology and Design Baidu Inc Adelaide AIML

Researchers from Nanjing University of Science and Technology, Baidu Inc., Adelaide AIML, and Singapore University of Technology and Design introduced ViLoMem, a dual-stream memory framework, enabling multimodal large language models (MLLMs) to learn from past multimodal reasoning and perception errors. The framework achieved consistent improvements in accuracy across six multimodal benchmarks, including gains of up to +6.48 on MathVision for GPT-4.1.

25 Nov 2025

machine-learning chain-of-thought computer-science

Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

This research integrates advanced structured prompting methods, particularly from DSPy, into the HELM framework to enable more robust and holistic evaluation of large language models. The study demonstrates that such methods consistently improve LM performance by an average of 4% absolute accuracy, reduce evaluation variance, and alter leaderboard rankings across diverse benchmarks by revealing a model's true capability ceiling.

184

26 Nov 2025

machine-learning agentic-frameworks computer-science

Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Researchers at FAIR at Meta developed Matrix, a peer-to-peer multi-agent framework for synthetic data generation that scales effectively for training large language models. The framework demonstrates up to 15.4x higher token throughput compared to centralized baselines across diverse tasks, while maintaining the quality of the generated data.

105

26 Nov 2025

machine-learning computer-science machine-learning

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Meta FAIR at Meta

Jiang et al. (Meta/FAIR) developed Iterative PPO, an algorithm for optimizing large language models towards multi-turn conversational outcomes by formally reducing the multi-turn reinforcement learning problem to a sequence of single-turn PPO tasks. This method leverages a learned multi-turn Q-function as the reward model for standard token-level Proximal Policy Optimization, demonstrating theoretical guarantees for policy improvement.

657

25 Nov 2025

machine-learning agents computer-science

Soft Adaptive Policy Optimization

Soft Adaptive Policy Optimization (SAPO), developed by Alibaba's Qwen Team, introduces a method for reinforcement learning fine-tuning of large language models that uses a smooth, temperature-controlled soft gate instead of hard clipping. The system achieves improved training stability and higher performance on mathematical reasoning and multimodal benchmarks, particularly by employing asymmetric control over positive and negative gradients.

26 Nov 2025

machine-learning computer-science machine-learning

Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning

Researchers at UVA and CMU developed an adaptive latent reasoning method for LLMs using reinforcement learning, achieving a 52.94% reduction in reasoning tokens and a 0.38% increase in accuracy on the GSM8K-Aug dataset. This approach enables models to dynamically adjust their reasoning length, leading to more efficient inference with 7.58x compression compared to Chain-of-Thought.

554

26 Nov 2025

machine-learning agentic-frameworks agents

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

University of Washington

Carnegie Mellon University

Allen Institute for AI

MIT Seattle Children",

DR Tulu introduces a new reinforcement learning approach called Reinforcement Learning with Evolving Rubrics (RLER) to directly train an 8B-parameter language model for open-ended, long-form deep research with proper attribution. The model surpasses existing open deep research models by 13.7 to 53.4 points on benchmarks, matches proprietary systems, and operates at significantly lower costs, such as 0.0019 USD per query.

299

290

26 Nov 2025

machine-learning computer-science artificial-intelligence

Terminal Velocity Matching

We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the

2

-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.

26 Nov 2025

machine-learning computer-science computation-and-language

A Systematic Study of Model Merging Techniques in Large Language Models

Technical University of Munich Helmholtz Munich Koç University Munich Center for Machine Learning

Researchers from Koç University and Technical University of Munich systematically evaluated various model merging techniques on Large Language Models (LLMs), revealing that simpler weight interpolation methods like Task Arithmetic consistently improve performance, while more complex subspace-based approaches largely fail. The study found Task Arithmetic reliably achieved modest constructive interference, outperforming both base and individual fine-tuned models on average benchmarks.

26 Nov 2025

machine-learning computer-science artificial-intelligence

From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models

Researchers from UC Berkeley and NVIDIA introduce Explore-Then-Exploit (ETE), a training-free parallel decoding strategy for Diffusion Language Models (DLMs) that actively seeks out and resolves high-information tokens. ETE reduces the required inference steps by 26-61% across standard benchmarks while maintaining or improving generation accuracy compared to existing confidence-based methods.

1,763

23 Nov 2025

machine-learning agentic-frameworks agents

General Agentic Memory Via Deep Research

Researchers from BAAI, Peking University, Renmin University of China, and Hong Kong Polytechnic University developed General Agentic Memory (GAM), a framework for AI agents that employs a Just-in-Time Compilation principle through a dual-agent (Memorizer and Researcher) architecture. This approach, designed for dynamic context creation, consistently surpassed existing memory systems and achieved over 90% accuracy on complex multi-step reasoning benchmarks like HotpotQA.

6,257

20 Nov 2025

machine-learning computer-science artificial-intelligence

Evolution Strategies at the Hyperscale

Bidipta Sarkar

Researchers from the University of Oxford, MILA, and NVIDIA introduce EGGROLL, an Evolution Strategies algorithm that scales black-box optimization to billion-parameter neural networks by employing low-rank parameter perturbations. The method achieves a hundredfold increase in training throughput and enables stable pre-training of pure-integer recurrent language models, demonstrating competitive or superior performance on reinforcement learning and large language model fine-tuning tasks.

144

25 Nov 2025

machine-learning computer-science artificial-intelligence

ROOT: Robust Orthogonalized Optimizer for Neural Network Training

Huawei Noah's Ark Lab introduces ROOT, a new optimizer that enhances large language model training stability and performance by employing a dimension-adaptive Newton iteration for consistent orthogonalization and a proximal optimization framework with soft-thresholding to suppress outlier gradient noise. The optimizer achieves faster convergence and superior zero-shot performance on LLM benchmarks, as well as improved accuracy on vision tasks.

903

26 Nov 2025

machine-learning computer-science artificial-intelligence

Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Researchers at EPFL systematically investigated various metadata types and integration strategies for efficient large language model pretraining. They found that prepending fine-grained quality scores or domain information, or appending metadata as an auxiliary task, substantially reduces the tokens required for pretraining, and also showed that models can learn quality-aware representations via empty meta-tokens.

25 Nov 2025

machine-learning computer-science artificial-intelligence

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

SPHINX is a new synthetic environment designed by researchers at RIT and the University of Washington to evaluate and advance visual perception and reasoning in Large Vision-Language Models (LVLMs). It features 25 procedurally generated task types with verifiable ground truth, revealing that state-of-the-art LVLMs achieve only 51.1% accuracy compared to 75.4% for humans, and demonstrating that Reinforcement Learning with Verifiable Rewards (RLVR) significantly improves LVLM performance and generalization.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

How to Correctly Report LLM-as-a-Judge Evaluations

Latent Collaboration in Multi-Agent Systems

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

On the Origin of Algorithmic Progress in AI

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Soft Adaptive Policy Optimization

Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Terminal Velocity Matching

A Systematic Study of Model Merging Techniques in Large Language Models

From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models

General Agentic Memory Via Deep Research

Evolution Strategies at the Hyperscale

ROOT: Robust Orthogonalized Optimizer for Neural Network Training

Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Events

AI4Science

Personalize Your Feed