Lightnews — Scholar-powered news

Paper

@paper.bsky.social

Top 30 most popular arXiv papers in the last 30 days.
[1/30] [2/30] [3/30] [4/30] [5/30] [6/30] [7/30] [8/30] [9/30] [10/30] [11/30] [12/30] [13/30] [14/30] [15/30] [16/30] [17/30] [18/30] [19/30] [20/30] [21/30] [22/30] [23/30] [24/30] [25/30] [26/30] [27/30] [28/30] [29/30] [30/30]

1/30 https://arxiv.org/abs/2512.16705
2/30 https://arxiv.org/abs/2512.24880
3/30 https://arxiv.org/abs/2601.07222
4/30 https://arxiv.org/abs/2512.18552
5/30 https://arxiv.org/abs/2512.23464
6/30 https://arxiv.org/abs/2512.24601
7/30 https://arxiv.org/abs/2512.23675
8/30 https://arxiv.org/abs/2512.16093
9/30 https://arxiv.org/abs/2512.18564
10/30 https://arxiv.org/abs/2512.16676
11/30 https://arxiv.org/abs/2601.09012
12/30 https://arxiv.org/abs/2512.21185
13/30 https://arxiv.org/abs/2601.06943
14/30 https://arxiv.org/abs/2601.05242
15/30 https://arxiv.org/abs/2512.16776
16/30 https://arxiv.org/abs/2601.06521
17/30 https://arxiv.org/abs/2512.24617
18/30 https://arxiv.org/abs/2512.16602
19/30 https://arxiv.org/abs/2601.05432
20/30 https://arxiv.org/abs/2512.20578
21/30 https://arxiv.org/abs/2601.10477
22/30 https://arxiv.org/abs/2512.24618
23/30 https://arxiv.org/abs/2601.00393
24/30 https://arxiv.org/abs/2601.09668
25/30 https://arxiv.org/abs/2601.03233
26/30 https://arxiv.org/abs/2512.17373
27/30 https://arxiv.org/abs/2601.06851
28/30 https://arxiv.org/abs/2512.16301
29/30 https://arxiv.org/abs/2512.16969
30/30 https://arxiv.org/abs/2512.24615

January 17, 2026 at 12:06 AM

Paper

@paper.bsky.social

[21/30] 138 Likes, 1 Comments, 1 Posts
2601.10477, cs․CV | cs․AI | cs․CY, 15 Jan 2026

🆕Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu, Xiangxiang Chu, Yansheng Li

As hubs of human activity, urban surfaces consist of a wealth of semantic entities.

Segmenting these various entities from satellite imagery is crucial for a range of downstream applications.

Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks).

In this work, we achieve socio-semantic segmentation by vision-language model reasoning.

To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure.

Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning.

We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model.

Experiments demonstrate our approach's gains over state-of-the-art models and strong zero-shot generalization.

Our dataset and code are available in https://github.com/AMAP-ML/SocioReasoner.

January 17, 2026 at 12:06 AM

Paper

@paper.bsky.social

[24/30] 129 Likes, 4 Comments, 1 Posts
2601.09668, cs․CV, 15 Jan 2026

🆕STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen...

$We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.$

January 17, 2026 at 12:05 AM

Paper

@paper.bsky.social

Top 30 most popular arXiv papers in the last 30 days.
[1/30] [2/30] [3/30] [4/30] [5/30] [6/30] [7/30] [8/30] [9/30] [10/30] [11/30] [12/30] [13/30] [14/30] [15/30] [16/30] [17/30] [18/30] [19/30] [20/30] [21/30] [22/30] [23/30] [24/30] [25/30] [26/30] [27/30] [28/30] [29/30] [30/30]

1/30 https://arxiv.org/abs/2512.16705
2/30 https://arxiv.org/abs/2512.24880
3/30 https://arxiv.org/abs/2512.15603
4/30 https://arxiv.org/abs/2601.07222
5/30 https://arxiv.org/abs/2512.18552
6/30 https://arxiv.org/abs/2512.23464
7/30 https://arxiv.org/abs/2512.24601
8/30 https://arxiv.org/abs/2512.16093
9/30 https://arxiv.org/abs/2512.23675
10/30 https://arxiv.org/abs/2512.18564
11/30 https://arxiv.org/abs/2512.16676
12/30 https://arxiv.org/abs/2512.21185
13/30 https://arxiv.org/abs/2601.06943
14/30 https://arxiv.org/abs/2601.05242
15/30 https://arxiv.org/abs/2512.16776
16/30 https://arxiv.org/abs/2601.06521
17/30 https://arxiv.org/abs/2512.24617
18/30 https://arxiv.org/abs/2512.20578
19/30 https://arxiv.org/abs/2512.16602
20/30 https://arxiv.org/abs/2601.05432
21/30 https://arxiv.org/abs/2512.24618
22/30 https://arxiv.org/abs/2601.00393
23/30 https://arxiv.org/abs/2512.15431
24/30 https://arxiv.org/abs/2601.09012
25/30 https://arxiv.org/abs/2512.17373
26/30 https://arxiv.org/abs/2601.03233
27/30 https://arxiv.org/abs/2512.16301
28/30 https://arxiv.org/abs/2512.16969
29/30 https://arxiv.org/abs/2512.24615
30/30 https://arxiv.org/abs/2601.06851

January 16, 2026 at 12:07 AM

Paper

@paper.bsky.social

[9/30] 356 Likes, 67 Comments, 4 Posts
2512.23675, cs․LG, 31 Dec 2025

🆕End-to-End Test-Time Training for Long Context

Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, ...

We formulate long-context language modeling as a problem in continual learning rather than architecture design.

Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention.

However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights.

In addition, we improve the model's initialization for learning at test time via meta-learning at training time.

Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms.

We conduct extensive experiments with a focus on scaling properties.

In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not.

However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context.

Our code is publicly available.

January 16, 2026 at 12:07 AM

Paper

@paper.bsky.social

[24/30] 118 Likes, 40 Comments, 2 Posts
2601.09012, cs․CL | cs․AI, 13 Jan 2026

🆕TranslateGemma Technical Report

Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, ...

We present TranslateGemma, a suite of open machine translation models based on the Gemma 3 foundation models.

To enhance the inherent multilingual capabilities of Gemma 3 for the translation task, we employ a two-stage fine-tuning process.

First, supervised fine-tuning is performed using a rich mixture of high-quality large-scale synthetic parallel data generated via state-of-the-art models and human-translated parallel data.

This is followed by a reinforcement learning phase, where we optimize translation quality using an ensemble of reward models, including MetricX-QE and AutoMQM, targeting translation quality.

We demonstrate the effectiveness of TranslateGemma with human evaluation on the WMT25 test set across 10 language pairs and with automatic evaluation on the WMT24++ benchmark across 55 language pairs.

Automatic metrics show consistent and substantial gains over the baseline Gemma 3 models across all sizes.

Notably, smaller TranslateGemma models often achieve performance comparable to larger baseline models, offering improved efficiency.

We also show that TranslateGemma models retain strong multimodal capabilities, with enhanced performance on the Vistra image translation benchmark.

The release of the open TranslateGemma models aims to provide the research community with powerful and adaptable tools for machine translation.

January 16, 2026 at 12:07 AM

Paper

@paper.bsky.social

[26/30] 118 Likes, 3 Comments, 1 Posts
2601.03233, cs․CV, 06 Jan 2026

🆕LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richards...

$Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.$

January 16, 2026 at 12:07 AM

Paper

@paper.bsky.social

[30/30] 111 Likes, 66 Comments, 1 Posts
2601.06851, cs․AI, 11 Jan 2026

🆕A Brain-like Synergistic Core in LLMs Drives Behaviour and Learning

Pedro Urbina-Rodriguez, Zafeirios Fountas, Fernando E. Rosas, Jun Wang, Andrea I. Luppi, Haitham Bou-Ammar, Murray Shanahan, Pedro A. M. Mediano

The independent evolution of intelligence in biological and artificial systems offers a unique opportunity to identify its fundamental computational principles.

Here we show that large language models spontaneously develop synergistic cores -- components where information integration exceeds individual parts -- remarkably similar to those in the human brain.

Using principles of information decomposition across multiple LLM model families and architectures, we find that areas in middle layers exhibit synergistic processing while early and late layers rely on redundancy, mirroring the informational organisation in biological brains.

This organisation emerges through learning and is absent in randomly initialised networks.

Crucially, ablating synergistic components causes disproportionate behavioural changes and performance loss, aligning with theoretical predictions about the fragility of synergy.

Moreover, fine-tuning synergistic regions through reinforcement learning yields significantly greater performance gains than training redundant components, yet supervised fine-tuning shows no such advantage.

This convergence suggests that synergistic information processing is a fundamental property of intelligence, providing targets for principled model design and testable predictions for biological intelligence.

January 16, 2026 at 12:06 AM

Paper

@paper.bsky.social

Top 30 most popular arXiv papers in the last 30 days.
[1/30] [2/30] [3/30] [4/30] [5/30] [6/30] [7/30] [8/30] [9/30] [10/30] [11/30] [12/30] [13/30] [14/30] [15/30] [16/30] [17/30] [18/30] [19/30] [20/30] [21/30] [22/30] [23/30] [24/30] [25/30] [26/30] [27/30] [28/30] [29/30] [30/30]

1/30 https://arxiv.org/abs/2512.16705
2/30 https://arxiv.org/abs/2512.24880
3/30 https://arxiv.org/abs/2512.14575
4/30 https://arxiv.org/abs/2512.15603
5/30 https://arxiv.org/abs/2512.18552
6/30 https://arxiv.org/abs/2512.23464
7/30 https://arxiv.org/abs/2512.24601
8/30 https://arxiv.org/abs/2601.07222
9/30 https://arxiv.org/abs/2512.16093
10/30 https://arxiv.org/abs/2512.14693
11/30 https://arxiv.org/abs/2512.14012
12/30 https://arxiv.org/abs/2512.18564
13/30 https://arxiv.org/abs/2512.16676
14/30 https://arxiv.org/abs/2512.21185
15/30 https://arxiv.org/abs/2601.06943
16/30 https://arxiv.org/abs/2512.16776
17/30 https://arxiv.org/abs/2601.05242
18/30 https://arxiv.org/abs/2601.06521
19/30 https://arxiv.org/abs/2512.24617
20/30 https://arxiv.org/abs/2512.16602
21/30 https://arxiv.org/abs/2512.20578
22/30 https://arxiv.org/abs/2601.05432
23/30 https://arxiv.org/abs/2512.24618
24/30 https://arxiv.org/abs/2601.00393
25/30 https://arxiv.org/abs/2512.15431
26/30 https://arxiv.org/abs/2512.14691
27/30 https://arxiv.org/abs/2512.17373
28/30 https://arxiv.org/abs/2512.16301
29/30 https://arxiv.org/abs/2512.16969
30/30 https://arxiv.org/abs/2512.24615

January 15, 2026 at 12:06 AM

Paper

@paper.bsky.social

[8/30] 449 Likes, 74 Comments, 3 Posts
2601.07222, math․AG | math․AT, 12 Jan 2026

🆕The motivic class of the space of genus $0$ maps to the flag variety

Jim Bryan, Balázs Elek, Freddie Manners, George Salafatinos, Ravi Vakil

$Let $\operatorname{Fl}_{n+1}$ be the variety of complete flags in $\mathbb{A}^{n+1}$ and let $Ω^{2}_β(\operatorname{Fl}_{n+1})$ be the space of based maps $f:\mathbb{P}^{1}\to \operatorname{Fl}_{n+1}$ in the class $f_{*}[\mathbb{P}^{1}]=β$. We show that under a mild positivity condition on $β$, the class of $Ω^{2}_β(\operatorname{Fl}_{n+1})$ in $K_{0}(\operatorname{Var})$, the Grothendieck group of varieties, is given by \[ [Ω^{2}_β(\operatorname{Fl}_{n+1})] = [\operatorname{GL}_{n}\times \mathbb{A}^{a}]. \] The proof of this result was obtained in conjunction with Google Gemini and related tools. We briefly discuss this research interaction, which may be of independent interest. However, the treatment in this paper is entirely human-authored (aside from excerpts in an appendix which are clearly marked as such).$

January 15, 2026 at 12:06 AM

Paper

@paper.bsky.social

Top 30 most popular arXiv papers in the last 30 days.
[1/30] [2/30] [3/30] [4/30] [5/30] [6/30] [7/30] [8/30] [9/30] [10/30] [11/30] [12/30] [13/30] [14/30] [15/30] [16/30] [17/30] [18/30] [19/30] [20/30] [21/30] [22/30] [23/30] [24/30] [25/30] [26/30] [27/30] [28/30] [29/30] [30/30]

January 14, 2026 at 12:06 AM

Paper

@paper.bsky.social

[15/30] 171 Likes, 4 Comments, 1 Posts
2601.06943, cs․CV | cs․AI, 11 Jan 2026

🆕Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Kunyi Wang, Rui Xu, Sen Hu, Ji...

In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification.

To bridge this gap, we construct the first video deep research benchmark, VideoDR.

VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains.

We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains.

Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks.

In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.

January 14, 2026 at 12:06 AM

Paper

@paper.bsky.social

[20/30] 146 Likes, 3 Comments, 1 Posts
2601.06521, cs․CV | cs․CL, 10 Jan 2026

🆕BabyVision: Visual Reasoning Beyond Language

Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y. charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang She...

While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding.

We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly.

To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs.

BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories.

Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines.

Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1.

These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives.

Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities.

We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit.

Our code and benchmark data are released at https://github.com/UniPat-AI/BabyVision for reproduction.

January 14, 2026 at 12:06 AM

Paper

@paper.bsky.social

Top 30 most popular arXiv papers in the last 30 days.
[1/30] [2/30] [3/30] [4/30] [5/30] [6/30] [7/30] [8/30] [9/30] [10/30] [11/30] [12/30] [13/30] [14/30] [15/30] [16/30] [17/30] [18/30] [19/30] [20/30] [21/30] [22/30] [23/30] [24/30] [25/30] [26/30] [27/30] [28/30] [29/30] [30/30]

January 13, 2026 at 12:06 AM

Paper

@paper.bsky.social

[22/30] 129 Likes, 3 Comments, 1 Posts
2601.05432, cs․CV | cs․AI | cs․CL, 08 Jan 2026

🆕Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, Xiangxiang Chu

$The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0\% to 22.1\% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.$

January 13, 2026 at 12:05 AM

Paper

@paper.bsky.social

Top 30 most popular arXiv papers in the last 30 days.
[1/30] [2/30] [3/30] [4/30] [5/30] [6/30] [7/30] [8/30] [9/30] [10/30] [11/30] [12/30] [13/30] [14/30] [15/30] [16/30] [17/30] [18/30] [19/30] [20/30] [21/30] [22/30] [23/30] [24/30] [25/30] [26/30] [27/30] [28/30] [29/30] [30/30]

January 12, 2026 at 12:06 AM

Paper

@paper.bsky.social

Top 30 most popular arXiv papers in the last 30 days.
[1/30] [2/30] [3/30] [4/30] [5/30] [6/30] [7/30] [8/30] [9/30] [10/30] [11/30] [12/30] [13/30] [14/30] [15/30] [16/30] [17/30] [18/30] [19/30] [20/30] [21/30] [22/30] [23/30] [24/30] [25/30] [26/30] [27/30] [28/30] [29/30] [30/30]

January 11, 2026 at 12:06 AM

Paper

@paper.bsky.social

[24/30] 118 Likes, 6 Comments, 1 Posts
2601.05242, cs․CL | cs․AI | cs․LG, 08 Jan 2026

🆕GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Ch...

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios.

To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors.

However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability.

In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure.

We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability.

We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length).

Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

January 11, 2026 at 12:06 AM

Paper

@paper.bsky.social

[26/30] 111 Likes, 9 Comments, 1 Posts
2512.16969, cs․AI | cs․CL | cs․LG, 18 Dec 2025

🆕Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuan...

Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking.

We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning.

SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs.

Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges.

We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer.

Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.

January 11, 2026 at 12:06 AM

Paper

@paper.bsky.social

[28/30] 108 Likes, 27 Comments, 4 Posts
2512.16923, cs․CV, 07 Jan 2026

🆕Generative Refocusing: Flexible Defocus Control from a Single Image

Chun-Wei Tuan Mu, Jia-Bin Huang, Yu-Lun Liu

Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment.

Single-image refocusing is still difficult.

It involves recovering sharp content and creating realistic bokeh.

Current methods have significant drawbacks.

They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture.

We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh.

Our main innovation is semi-supervised training.

This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide.

Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks.

Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.

January 11, 2026 at 12:06 AM

Paper

@paper.bsky.social

Top 30 most popular arXiv papers in the last 30 days.
[1/30] [2/30] [3/30] [4/30] [5/30] [6/30] [7/30] [8/30] [9/30] [10/30] [11/30] [12/30] [13/30] [14/30] [15/30] [16/30] [17/30] [18/30] [19/30] [20/30] [21/30] [22/30] [23/30] [24/30] [25/30] [26/30] [27/30] [28/30] [29/30] [30/30]

1/30 https://arxiv.org/abs/2512.16705
2/30 https://arxiv.org/abs/2512.24880
3/30 https://arxiv.org/abs/2512.10685
4/30 https://arxiv.org/abs/2512.14575
5/30 https://arxiv.org/abs/2512.15603
6/30 https://arxiv.org/abs/2512.18552
7/30 https://arxiv.org/abs/2512.23464
8/30 https://arxiv.org/abs/2512.24601
9/30 https://arxiv.org/abs/2512.16093
10/30 https://arxiv.org/abs/2512.14693
11/30 https://arxiv.org/abs/2512.14012
12/30 https://arxiv.org/abs/2512.18564
13/30 https://arxiv.org/abs/2512.21185
14/30 https://arxiv.org/abs/2512.13604
15/30 https://arxiv.org/abs/2512.24617
16/30 https://arxiv.org/abs/2512.11146
17/30 https://arxiv.org/abs/2512.16602
18/30 https://arxiv.org/abs/2512.13564
19/30 https://arxiv.org/abs/2512.15431
20/30 https://arxiv.org/abs/2512.20578
21/30 https://arxiv.org/abs/2512.24618
22/30 https://arxiv.org/abs/2512.14691
23/30 https://arxiv.org/abs/2512.17373
24/30 https://arxiv.org/abs/2512.10430
25/30 https://arxiv.org/abs/2512.17220
26/30 https://arxiv.org/abs/2601.00393
27/30 https://arxiv.org/abs/2512.14699
28/30 https://arxiv.org/abs/2512.24615
29/30 https://arxiv.org/abs/2512.12967
30/30 https://arxiv.org/abs/2512.13687

January 10, 2026 at 12:07 AM

Paper

@paper.bsky.social

[25/30] 110 Likes, 3 Comments, 1 Posts
2512.17220, cs․CL, 19 Dec 2025

🆕Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu, Weiping Wang, Jie Zhou, Mo Yu

Humans understand long and complex texts by relying on a holistic semantic representation of the content.

This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology.

Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks.

In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness.

MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation.

This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context.

We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making.

It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.

January 10, 2026 at 12:06 AM

Paper

@paper.bsky.social

[26/30] 106 Likes, 4 Comments, 1 Posts
2601.00393, cs․CV, 01 Jan 2026

🆕NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, Zhaoxiang Zhang

In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications.

We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing.

In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos.

Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques.

These designs empower NeoVerse with versatility and generalization to various domains.

Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks.

Our project page is available at https://neoverse-4d.github.io

January 10, 2026 at 12:06 AM

Paper

@paper.bsky.social

[28/30] 104 Likes, 4 Comments, 1 Posts
2512.24615, cs․AI, 31 Dec 2025

🆕Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu, Lichao Chen, Yulei Qin, Zhijian Zhou, Xiang Fei, Chaofan Qiu, Xiaoyu Tan, Gang L...

$Existing Large Language Model (LLM) agent frameworks face two significant challenges: high configuration costs and static capabilities. Building a high-quality agent often requires extensive manual effort in tool integration and prompt engineering, while deployed agents struggle to adapt to dynamic environments without expensive fine-tuning. To address these issues, we propose \textbf{Youtu-Agent}, a modular framework designed for the automated generation and continuous evolution of LLM agents. Youtu-Agent features a structured configuration system that decouples execution environments, toolkits, and context management, enabling flexible reuse and automated synthesis. We introduce two generation paradigms: a \textbf{Workflow} mode for standard tasks and a \textbf{Meta-Agent} mode for complex, non-standard requirements, capable of automatically generating tool code, prompts, and configurations. Furthermore, Youtu-Agent establishes a hybrid policy optimization system: (1) an \textbf{Agent Practice} module that enables agents to accumulate experience and improve performance through in-context optimization without parameter updates; and (2) an \textbf{Agent RL} module that integrates with distributed training frameworks to enable scalable and stable reinforcement learning of any Youtu-Agents in an end-to-end, large-scale manner. Experiments demonstrate that Youtu-Agent achieves state-of-the-art performance on WebWalkerQA (71.47\%) and GAIA (72.8\%) using open-weight models. Our automated generation pipeline achieves over 81\% tool synthesis success rate, while the Practice module improves performance on AIME 2024/2025 by +2.7\% and +5.4\% respectively. Moreover, our Agent RL training achieves 40\% speedup with steady performance improvement on 7B LLMs, enhancing coding/reasoning and searching capabilities respectively up to 35\% and 21\% on Maths and general/multi-hop QA benchmarks.$

January 10, 2026 at 12:06 AM

Paper

@paper.bsky.social

[29/30] 103 Likes, 5 Comments, 1 Posts
2512.12967, cs․CL, 15 Dec 2025

🆕QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng ...

We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations.

The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence.

By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities.

(2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs.

(3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens.

Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average.

On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline.

Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.

January 10, 2026 at 12:06 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news