Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.
148
Researchers from Yonsei University and the University of Wisconsin–Madison developed a statistical framework to correct for bias and quantify uncertainty in Large Language Model (LLM)-as-a-Judge evaluations. The framework provides a bias-adjusted estimator for true model accuracy and constructs confidence intervals that account for uncertainty from both test and calibration datasets, demonstrating reliable coverage probabilities and reduced bias in simulations.
The Qwen Team at Alibaba Group introduced Qwen3-VL, an open-source family of vision-language models that integrates text, image, and video understanding with native ultra-long context support up to 256K tokens. It achieved state-of-the-art results across a wide range of multimodal benchmarks, including VQA, multimodal reasoning, grounding, and video understanding, demonstrating over 70% accuracy for 32 out of 39 supported OCR languages.
16,661
Researchers from Princeton, UIUC, and Stanford developed LatentMAS, a training-free framework enabling multi-agent Large Language Model systems to collaborate by generating and transferring continuous latent representations rather than text. This approach yielded an average of 2.8%-4.6% higher accuracy, reduced token usage by 70.8%-83.7%, and accelerated inference by 4x-4.3x compared to text-based multi-agent systems.
8
NVIDIA researchers developed ToolOrchestra, a method for training an 8-billion-parameter language model to act as an intelligent orchestrator, coordinating diverse models and tools to solve complex agentic tasks. This approach achieves state-of-the-art performance on benchmarks like Humanity's Last Exam while being 2.5 times more computationally efficient than frontier models and demonstrating robust generalization.
2
Monet, a framework developed by researchers from Peking University, ByteDance's Kling Team, and Amazon, enables multimodal large language models to perform abstract visual reasoning by directly operating within a continuous latent visual space. The approach achieves superior performance on real-world perception and reasoning benchmarks and demonstrates strong out-of-distribution generalization capabilities.
The Chain-of-Visual-Thought (COVT) framework enables Vision-Language Models (VLMs) to reason using continuous visual tokens, rather than discrete text, for enhanced perceptual understanding. This approach yields an overall 5.5% gain on CV-Bench, including a 14.0% improvement on its depth sub-task, while maintaining performance on general VLM benchmarks.
Researchers at MIT FutureTech, CSAIL, experimentally decomposed algorithmic efficiency gains in AI, primarily language models, demonstrating that most progress is tied to compute scale. Their analysis indicates that 89% of the estimated 6,930x total algorithmic efficiency gains at the 2023 compute frontier originate from scale-dependent innovations such as the Transformer architecture and Chinchilla scaling, while scale-invariant improvements contribute less than 10%.
Researchers from Google DeepMind and UT Austin developed CamFormer, a lightweight Transformer model, to extract semantic meaning from camera trajectories for video content understanding. This approach effectively aligns camera motion with natural language descriptions, showing strong performance across egocentric and exocentric perception tasks and proving robust to various pose estimation methods, often complementing or outperforming pixel-based baselines.
Researchers from Nanjing University of Science and Technology, Baidu Inc., Adelaide AIML, and Singapore University of Technology and Design introduced ViLoMem, a dual-stream memory framework, enabling multimodal large language models (MLLMs) to learn from past multimodal reasoning and perception errors. The framework achieved consistent improvements in accuracy across six multimodal benchmarks, including gains of up to +6.48 on MathVision for GPT-4.1.
2
This research integrates advanced structured prompting methods, particularly from DSPy, into the HELM framework to enable more robust and holistic evaluation of large language models. The study demonstrates that such methods consistently improve LM performance by an average of 4% absolute accuracy, reduce evaluation variance, and alter leaderboard rankings across diverse benchmarks by revealing a model's true capability ceiling.
Researchers at FAIR at Meta developed Matrix, a peer-to-peer multi-agent framework for synthetic data generation that scales effectively for training large language models. The framework demonstrates up to 15.4x higher token throughput compared to centralized baselines across diverse tasks, while maintaining the quality of the generated data.
105
Jiang et al. (Meta/FAIR) developed Iterative PPO, an algorithm for optimizing large language models towards multi-turn conversational outcomes by formally reducing the multi-turn reinforcement learning problem to a sequence of single-turn PPO tasks. This method leverages a learned multi-turn Q-function as the reward model for standard token-level Proximal Policy Optimization, demonstrating theoretical guarantees for policy improvement.
Soft Adaptive Policy Optimization (SAPO), developed by Alibaba's Qwen Team, introduces a method for reinforcement learning fine-tuning of large language models that uses a smooth, temperature-controlled soft gate instead of hard clipping. The system achieves improved training stability and higher performance on mathematical reasoning and multimodal benchmarks, particularly by employing asymmetric control over positive and negative gradients.
Process Reward Feedback Learning (PRFL) leverages pre-trained video generation models themselves as latent reward models to efficiently align video generation with human preferences, addressing computational and optimization challenges in video synthesis.
G G ²VLM integrates 3D reconstruction and spatial reasoning within a single Vision-Language Model, addressing the spatial intelligence limitations of current VLMs. It learns explicit visual geometry from 2D data using a Mixture-of-Transformer-Experts architecture, leading to robust spatial understanding and strong performance on both 3D reconstruction and complex spatial reasoning benchmarks.
1
Agent0-VL from UNC-Chapel Hill introduces a self-evolving vision-language agent that integrates tool usage into both reasoning and its self-evaluation processes, resolving issues with purely text-based self-critique in multimodal tasks. This framework enables continuous self-improvement without external human reward signals, achieving substantial performance improvements and outperforming closed-source systems like GPT-4o on MathVista, HallBench, and ChartQA.
268
Researchers at UVA and CMU developed an adaptive latent reasoning method for LLMs using reinforcement learning, achieving a 52.94% reduction in reasoning tokens and a 0.38% increase in accuracy on the GSM8K-Aug dataset. This approach enables models to dynamically adjust their reasoning length, leading to more efficient inference with 7.58x compression compared to Chain-of-Thought.
Adv-GRPO, developed by NUS Show Lab and ByteDance, introduces a reinforcement learning framework with an adversarial reward mechanism and visual foundation models like DINO to mitigate reward hacking in text-to-image generation. The method achieves a 70.0% win rate in image quality and 85.3% in aesthetics against Flow-GRPO in human evaluations, while also improving OCR accuracy from 0.59 to 0.69.
11
A conceptual framework is introduced for understanding deep language comprehension in the human brain, positing that a specialized core language system exports information to other functionally distinct cognitive modules to construct rich mental models. This neurocognitive model informs the development of advanced artificial intelligence by suggesting analogous modular architectures are needed for truly grounded AI understanding.
18
There are no more papers matching your filters at the moment.