machine-learning
Researchers from Yonsei University and the University of Wisconsin–Madison developed a statistical framework to correct for bias and quantify uncertainty in Large Language Model (LLM)-as-a-Judge evaluations. The framework provides a bias-adjusted estimator for true model accuracy and constructs confidence intervals that account for uncertainty from both test and calibration datasets, demonstrating reliable coverage probabilities and reduced bias in simulations.
Researchers from Princeton, UIUC, and Stanford developed LatentMAS, a training-free framework enabling multi-agent Large Language Model systems to collaborate by generating and transferring continuous latent representations rather than text. This approach yielded an average of 2.8%-4.6% higher accuracy, reduced token usage by 70.8%-83.7%, and accelerated inference by 4x-4.3x compared to text-based multi-agent systems.
8
NVIDIA researchers developed ToolOrchestra, a method for training an 8-billion-parameter language model to act as an intelligent orchestrator, coordinating diverse models and tools to solve complex agentic tasks. This approach achieves state-of-the-art performance on benchmarks like Humanity's Last Exam while being 2.5 times more computationally efficient than frontier models and demonstrating robust generalization.
2
The Chain-of-Visual-Thought (COVT) framework enables Vision-Language Models (VLMs) to reason using continuous visual tokens, rather than discrete text, for enhanced perceptual understanding. This approach yields an overall 5.5% gain on CV-Bench, including a 14.0% improvement on its depth sub-task, while maintaining performance on general VLM benchmarks.
Researchers at MIT FutureTech, CSAIL, experimentally decomposed algorithmic efficiency gains in AI, primarily language models, demonstrating that most progress is tied to compute scale. Their analysis indicates that 89% of the estimated 6,930x total algorithmic efficiency gains at the 2023 compute frontier originate from scale-dependent innovations such as the Transformer architecture and Chinchilla scaling, while scale-invariant improvements contribute less than 10%.
Researchers from Nanjing University of Science and Technology, Baidu Inc., Adelaide AIML, and Singapore University of Technology and Design introduced ViLoMem, a dual-stream memory framework, enabling multimodal large language models (MLLMs) to learn from past multimodal reasoning and perception errors. The framework achieved consistent improvements in accuracy across six multimodal benchmarks, including gains of up to +6.48 on MathVision for GPT-4.1.
2
This research integrates advanced structured prompting methods, particularly from DSPy, into the HELM framework to enable more robust and holistic evaluation of large language models. The study demonstrates that such methods consistently improve LM performance by an average of 4% absolute accuracy, reduce evaluation variance, and alter leaderboard rankings across diverse benchmarks by revealing a model's true capability ceiling.
Researchers at FAIR at Meta developed Matrix, a peer-to-peer multi-agent framework for synthetic data generation that scales effectively for training large language models. The framework demonstrates up to 15.4x higher token throughput compared to centralized baselines across diverse tasks, while maintaining the quality of the generated data.
105
Jiang et al. (Meta/FAIR) developed Iterative PPO, an algorithm for optimizing large language models towards multi-turn conversational outcomes by formally reducing the multi-turn reinforcement learning problem to a sequence of single-turn PPO tasks. This method leverages a learned multi-turn Q-function as the reward model for standard token-level Proximal Policy Optimization, demonstrating theoretical guarantees for policy improvement.
Soft Adaptive Policy Optimization (SAPO), developed by Alibaba's Qwen Team, introduces a method for reinforcement learning fine-tuning of large language models that uses a smooth, temperature-controlled soft gate instead of hard clipping. The system achieves improved training stability and higher performance on mathematical reasoning and multimodal benchmarks, particularly by employing asymmetric control over positive and negative gradients.
Researchers at UVA and CMU developed an adaptive latent reasoning method for LLMs using reinforcement learning, achieving a 52.94% reduction in reasoning tokens and a 0.38% increase in accuracy on the GSM8K-Aug dataset. This approach enables models to dynamically adjust their reasoning length, leading to more efficient inference with 7.58x compression compared to Chain-of-Thought.
DR Tulu introduces a new reinforcement learning approach called Reinforcement Learning with Evolving Rubrics (RLER) to directly train an 8B-parameter language model for open-ended, long-form deep research with proper attribution. The model surpasses existing open deep research models by 13.7 to 53.4 points on benchmarks, matches proprietary systems, and operates at significantly lower costs, such as 0.0019 USD per query.
299
We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the 22-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.
28
Researchers from Koç University and Technical University of Munich systematically evaluated various model merging techniques on Large Language Models (LLMs), revealing that simpler weight interpolation methods like Task Arithmetic consistently improve performance, while more complex subspace-based approaches largely fail. The study found Task Arithmetic reliably achieved modest constructive interference, outperforming both base and individual fine-tuned models on average benchmarks.
Researchers from UC Berkeley and NVIDIA introduce Explore-Then-Exploit (ETE), a training-free parallel decoding strategy for Diffusion Language Models (DLMs) that actively seeks out and resolves high-information tokens. ETE reduces the required inference steps by 26-61% across standard benchmarks while maintaining or improving generation accuracy compared to existing confidence-based methods.
Researchers from BAAI, Peking University, Renmin University of China, and Hong Kong Polytechnic University developed General Agentic Memory (GAM), a framework for AI agents that employs a Just-in-Time Compilation principle through a dual-agent (Memorizer and Researcher) architecture. This approach, designed for dynamic context creation, consistently surpassed existing memory systems and achieved over 90% accuracy on complex multi-step reasoning benchmarks like HotpotQA.
Researchers from the University of Oxford, MILA, and NVIDIA introduce EGGROLL, an Evolution Strategies algorithm that scales black-box optimization to billion-parameter neural networks by employing low-rank parameter perturbations. The method achieves a hundredfold increase in training throughput and enables stable pre-training of pure-integer recurrent language models, demonstrating competitive or superior performance on reinforcement learning and large language model fine-tuning tasks.
Huawei Noah's Ark Lab introduces ROOT, a new optimizer that enhances large language model training stability and performance by employing a dimension-adaptive Newton iteration for consistent orthogonalization and a proximal optimization framework with soft-thresholding to suppress outlier gradient noise. The optimizer achieves faster convergence and superior zero-shot performance on LLM benchmarks, as well as improved accuracy on vision tasks.
903
Researchers at EPFL systematically investigated various metadata types and integration strategies for efficient large language model pretraining. They found that prepending fine-grained quality scores or domain information, or appending metadata as an auxiliary task, substantially reduces the tokens required for pretraining, and also showed that models can learn quality-aware representations via empty meta-tokens.
SPHINX is a new synthetic environment designed by researchers at RIT and the University of Washington to evaluate and advance visual perception and reasoning in Large Vision-Language Models (LVLMs). It features 25 procedurally generated task types with verifiable ground truth, revealing that state-of-the-art LVLMs achieve only 51.1% accuracy compared to 75.4% for humans, and demonstrating that Reinforcement Learning with Verifiable Rewards (RLVR) significantly improves LVLM performance and generalization.
There are no more papers matching your filters at the moment.